Popular Threads From Hotspot-gc-dev:
List Statistics
- Total Threads: 1110
- Total Posts: 987
Phrases Used to Find This Thread
|
# 1

06-07-2010 10:09 PM
|
|
|
Todd,
Could you send a segment of the GC logs from the beginning
through the first dozen or so full GC's?
Exactly which version of the JVM are you using?
java -version
will tell us.
Do you have a test setup where you could do some experiments?
Can you send the set of CMS flags you use? It might tell
us something about the GC behavior of you application.
Might not tell us anything but it's worth a look.
Jon
On 07/06/10 13:27, Todd Lipcon wrote:
> Hi all,
>
> I work on HBase, a distributed database written in Java. We generally
> run on large heaps (8GB+), and our object lifetime distribution has
> proven pretty problematic for garbage collection (we manage a multi-GB
> LRU cache inside the process, so in CMS we tenure a lot of byte arrays
> which later get collected).
>
> In Java6, we generally run with the CMS collector, tuning down
> CMSInitiatingOccupancyFraction and constraining MaxNewSize to achieve
> fairly low pause GC, but after a week or two of uptime we often run
> into full heap compaction which takes several minutes and wreaks havoc
> on the system.
>
> Needless to say, we've been watching the development of the G1 GC with
> anticipation for the last year or two. Finally it seems in the latest
> build of JDK7 it's stable enough for actual use (previously it would
> segfault within an hour or so). However, in my testing I'm seeing a
> fair amount of 8-10 second Full GC pauses.
>
> The flags I'm passing are:
>
> -Xmx8000m -XX:+UseG1GC -XX:MaxGCPauseMillis=20
> -XX:GCPauseIntervalMillis=80
>
> Most of the pauses I see in the GC log are around 10-20ms as expected:
>
> 2010-07-05T22:43:19.849-0700: 1680.079: [GC pause (partial),
> 0.01209000 secs]
> [Parallel Time: 10.5 ms]
> [Update RS (Start) (ms): 1680080.2 1680080.1 1680080.2
> 1680079.9 1680080.0 1680080.2 1680080.1 1680080.1 1680080.0
> 1680080.1 1680080.0 1680079.9 1680081.5]
> [Update RS (ms): 1.4 2.0 2.2 1.8 1.7 1.4 2.5 2.2 1.9
> 2.5 1.7 1.7 0.1
> Avg: 1.8, Min: 0.1, Max: 2.5]
> [Processed Buffers : 8 1 3 1 1 7 3 2 6 2 7 8 3
> Sum: 52, Avg: 4, Min: 1, Max: 8]
> [Ext Root Scanning (ms): 0.7 0.5 0.5 0.3 0.5 0.7 0.4 0.5
> 0.4 0.5 0.3 0.3 0.0
> Avg: 0.4, Min: 0.0, Max: 0.7]
> [Mark Stack Scanning (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0
> 0.0 0.0 0.0 0.0 0.0 0.0
> Avg: 0.0, Min: 0.0, Max: 0.0]
> [Scan RS (ms): 0.9 0.4 0.1 0.7 0.7 0.9 0.0 0.1 0.6 0.0
> 0.8 0.8 0.9
> Avg: 0.5, Min: 0.0, Max: 0.9]
> [Object Copy (ms): 7.2 7.2 7.3 7.3 7.1 7.1 7.0 7.2 7.1
> 6.9 7.1 7.1 7.0
> Avg: 7.1, Min: 6.9, Max: 7.3]
> [Termination (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
> 0.0 0.0 0.0 0.0
> Avg: 0.0, Min: 0.0, Max: 0.0]
> [Other: 0.6 ms]
> [Clear CT: 0.5 ms]
> [Other: 1.1 ms]
> [Choose CSet: 0.0 ms]
> [ 7677M->7636M(8000M)]
> [Times: user=0.12 sys=0.00, real=0.01 secs]
>
> But every 5-10 minutes I see a GC pause that lasts 10-15 seconds:
> [todd@monster01 logs]$ grep 'Full GC' gc-hbase.log | tail
> 2010-07-06T12:50:41.216-0700: 52521.446: [Full GC 7934M->4865M(8000M),
> 9.8907800 secs]
> 2010-07-06T12:55:39.802-0700: 52820.032: [Full GC 7930M->4964M(8000M),
> 9.9025520 secs]
> 2010-07-06T13:02:26.872-0700: 53227.102: [Full GC 7934M->4882M(8000M),
> 10.1232190 secs]
> 2010-07-06T13:09:41.049-0700: 53661.279: [Full GC 7938M->5002M(8000M),
> 10.4997760 secs]
> 2010-07-06T13:18:51.531-0700: 54211.761: [Full GC 7938M->4962M(8000M),
> 11.0497380 secs]
>
> These pauses are pretty unacceptable for soft real time operation.
>
> Am I missing some tuning that should be done for G1GC for applications
> like this? Is 20ms out of 80ms too aggressive a target for the garbage
> rates we're generating?
>
> My actual live heap usage should be very stable around 5GB - the
> application very carefully accounts its live object set at around 60%
> of the max heap (as you can see in the logs above).
>
> At this point we are considering doing crazy things like ripping out
> our main memory consumers into a custom slab allocator, and manually
> reference count the byte array slices. But, if we can get G1GC to work
> for us, it will save a lot of engineering on the application side!
>
> Thanks
> -Todd
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
> ------------------------------------------------------------------------
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
|
# 2

06-07-2010 10:12 PM
|
|
|
Todd,
Could you send a segment of the GC logs from the beginning
through the first dozen or so full GC's?
Exactly which version of the JVM are you using?
java -version
will tell us.
Do you have a test setup where you could do some experiments?
Can you send the set of CMS flags you use? It might tell
us something about the GC behavior of you application.
Might not tell us anything but it's worth a look.
Jon
On 07/06/10 13:27, Todd Lipcon wrote:
> Hi all,
>
> I work on HBase, a distributed database written in Java. We generally
> run on large heaps (8GB+), and our object lifetime distribution has
> proven pretty problematic for garbage collection (we manage a multi-GB
> LRU cache inside the process, so in CMS we tenure a lot of byte arrays
> which later get collected).
>
> In Java6, we generally run with the CMS collector, tuning down
> CMSInitiatingOccupancyFraction and constraining MaxNewSize to achieve
> fairly low pause GC, but after a week or two of uptime we often run
> into full heap compaction which takes several minutes and wreaks havoc
> on the system.
>
> Needless to say, we've been watching the development of the G1 GC with
> anticipation for the last year or two. Finally it seems in the latest
> build of JDK7 it's stable enough for actual use (previously it would
> segfault within an hour or so). However, in my testing I'm seeing a
> fair amount of 8-10 second Full GC pauses.
>
> The flags I'm passing are:
>
> -Xmx8000m -XX:+UseG1GC -XX:MaxGCPauseMillis=20
> -XX:GCPauseIntervalMillis=80
>
> Most of the pauses I see in the GC log are around 10-20ms as expected:
>
> 2010-07-05T22:43:19.849-0700: 1680.079: [GC pause (partial),
> 0.01209000 secs]
> [Parallel Time: 10.5 ms]
> [Update RS (Start) (ms): 1680080.2 1680080.1 1680080.2
> 1680079.9 1680080.0 1680080.2 1680080.1 1680080.1 1680080.0
> 1680080.1 1680080.0 1680079.9 1680081.5]
> [Update RS (ms): 1.4 2.0 2.2 1.8 1.7 1.4 2.5 2.2 1.9
> 2.5 1.7 1.7 0.1
> Avg: 1.8, Min: 0.1, Max: 2.5]
> [Processed Buffers : 8 1 3 1 1 7 3 2 6 2 7 8 3
> Sum: 52, Avg: 4, Min: 1, Max: 8]
> [Ext Root Scanning (ms): 0.7 0.5 0.5 0.3 0.5 0.7 0.4 0.5
> 0.4 0.5 0.3 0.3 0.0
> Avg: 0.4, Min: 0.0, Max: 0.7]
> [Mark Stack Scanning (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0
> 0.0 0.0 0.0 0.0 0.0 0.0
> Avg: 0.0, Min: 0.0, Max: 0.0]
> [Scan RS (ms): 0.9 0.4 0.1 0.7 0.7 0.9 0.0 0.1 0.6 0.0
> 0.8 0.8 0.9
> Avg: 0.5, Min: 0.0, Max: 0.9]
> [Object Copy (ms): 7.2 7.2 7.3 7.3 7.1 7.1 7.0 7.2 7.1
> 6.9 7.1 7.1 7.0
> Avg: 7.1, Min: 6.9, Max: 7.3]
> [Termination (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
> 0.0 0.0 0.0 0.0
> Avg: 0.0, Min: 0.0, Max: 0.0]
> [Other: 0.6 ms]
> [Clear CT: 0.5 ms]
> [Other: 1.1 ms]
> [Choose CSet: 0.0 ms]
> [ 7677M->7636M(8000M)]
> [Times: user=0.12 sys=0.00, real=0.01 secs]
>
> But every 5-10 minutes I see a GC pause that lasts 10-15 seconds:
> [todd@monster01 logs]$ grep 'Full GC' gc-hbase.log | tail
> 2010-07-06T12:50:41.216-0700: 52521.446: [Full GC 7934M->4865M(8000M),
> 9.8907800 secs]
> 2010-07-06T12:55:39.802-0700: 52820.032: [Full GC 7930M->4964M(8000M),
> 9.9025520 secs]
> 2010-07-06T13:02:26.872-0700: 53227.102: [Full GC 7934M->4882M(8000M),
> 10.1232190 secs]
> 2010-07-06T13:09:41.049-0700: 53661.279: [Full GC 7938M->5002M(8000M),
> 10.4997760 secs]
> 2010-07-06T13:18:51.531-0700: 54211.761: [Full GC 7938M->4962M(8000M),
> 11.0497380 secs]
>
> These pauses are pretty unacceptable for soft real time operation.
>
> Am I missing some tuning that should be done for G1GC for applications
> like this? Is 20ms out of 80ms too aggressive a target for the garbage
> rates we're generating?
>
> My actual live heap usage should be very stable around 5GB - the
> application very carefully accounts its live object set at around 60%
> of the max heap (as you can see in the logs above).
>
> At this point we are considering doing crazy things like ripping out
> our main memory consumers into a custom slab allocator, and manually
> reference count the byte array slices. But, if we can get G1GC to work
> for us, it will save a lot of engineering on the application side!
>
> Thanks
> -Todd
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
> ------------------------------------------------------------------------
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
Did you try doubling the heap size? You might want to post a full
log so we can see what's happening between those full collections.
Also, If you have comparable CMS logs
all the better, as a known starting point. The full gc's almost
look like the heap got too full, so it must mean that incremental
collection is not keeping up with the rate of garbage generation.
Also, what's the JDK build you are running?
-- ramki
On 07/06/10 13:27, Todd Lipcon wrote:
> Hi all,
>
> I work on HBase, a distributed database written in Java. We generally
> run on large heaps (8GB+), and our object lifetime distribution has
> proven pretty problematic for garbage collection (we manage a multi-GB
> LRU cache inside the process, so in CMS we tenure a lot of byte arrays
> which later get collected).
>
> In Java6, we generally run with the CMS collector, tuning down
> CMSInitiatingOccupancyFraction and constraining MaxNewSize to achieve
> fairly low pause GC, but after a week or two of uptime we often run into
> full heap compaction which takes several minutes and wreaks havoc on the
> system.
>
> Needless to say, we've been watching the development of the G1 GC with
> anticipation for the last year or two. Finally it seems in the latest
> build of JDK7 it's stable enough for actual use (previously it would
> segfault within an hour or so). However, in my testing I'm seeing a fair
> amount of 8-10 second Full GC pauses.
>
> The flags I'm passing are:
>
> -Xmx8000m -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:GCPauseIntervalMillis=80
>
> Most of the pauses I see in the GC log are around 10-20ms as expected:
>
> 2010-07-05T22:43:19.849-0700: 1680.079: [GC pause (partial), 0.01209000
> secs]
> [Parallel Time: 10.5 ms]
> [Update RS (Start) (ms): 1680080.2 1680080.1 1680080.2
> 1680079.9 1680080.0 1680080.2 1680080.1 1680080.1 1680080.0
> 1680080.1 1680080.0 1680079.9 1680081.5]
> [Update RS (ms): 1.4 2.0 2.2 1.8 1.7 1.4 2.5 2.2 1.9 2.5
> 1.7 1.7 0.1
> Avg: 1.8, Min: 0.1, Max: 2.5]
> [Processed Buffers : 8 1 3 1 1 7 3 2 6 2 7 8 3
> Sum: 52, Avg: 4, Min: 1, Max: 8]
> [Ext Root Scanning (ms): 0.7 0.5 0.5 0.3 0.5 0.7 0.4 0.5
> 0.4 0.5 0.3 0.3 0.0
> Avg: 0.4, Min: 0.0, Max: 0.7]
> [Mark Stack Scanning (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
> 0.0 0.0 0.0 0.0 0.0
> Avg: 0.0, Min: 0.0, Max: 0.0]
> [Scan RS (ms): 0.9 0.4 0.1 0.7 0.7 0.9 0.0 0.1 0.6 0.0
> 0.8 0.8 0.9
> Avg: 0.5, Min: 0.0, Max: 0.9]
> [Object Copy (ms): 7.2 7.2 7.3 7.3 7.1 7.1 7.0 7.2 7.1
> 6.9 7.1 7.1 7.0
> Avg: 7.1, Min: 6.9, Max: 7.3]
> [Termination (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
> 0.0 0.0 0.0 0.0
> Avg: 0.0, Min: 0.0, Max: 0.0]
> [Other: 0.6 ms]
> [Clear CT: 0.5 ms]
> [Other: 1.1 ms]
> [Choose CSet: 0.0 ms]
> [ 7677M->7636M(8000M)]
> [Times: user=0.12 sys=0.00, real=0.01 secs]
>
> But every 5-10 minutes I see a GC pause that lasts 10-15 seconds:
> [todd@monster01 logs]$ grep 'Full GC' gc-hbase.log | tail
> 2010-07-06T12:50:41.216-0700: 52521.446: [Full GC 7934M->4865M(8000M),
> 9.8907800 secs]
> 2010-07-06T12:55:39.802-0700: 52820.032: [Full GC 7930M->4964M(8000M),
> 9.9025520 secs]
> 2010-07-06T13:02:26.872-0700: 53227.102: [Full GC 7934M->4882M(8000M),
> 10.1232190 secs]
> 2010-07-06T13:09:41.049-0700: 53661.279: [Full GC 7938M->5002M(8000M),
> 10.4997760 secs]
> 2010-07-06T13:18:51.531-0700: 54211.761: [Full GC 7938M->4962M(8000M),
> 11.0497380 secs]
>
> These pauses are pretty unacceptable for soft real time operation.
>
> Am I missing some tuning that should be done for G1GC for applications
> like this? Is 20ms out of 80ms too aggressive a target for the garbage
> rates we're generating?
>
> My actual live heap usage should be very stable around 5GB - the
> application very carefully accounts its live object set at around 60% of
> the max heap (as you can see in the logs above).
>
> At this point we are considering doing crazy things like ripping out our
> main memory consumers into a custom slab allocator, and manually
> reference count the byte array slices. But, if we can get G1GC to work
> for us, it will save a lot of engineering on the application side!
>
> Thanks
> -Todd
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>
>
> ------------------------------------------------------------------------
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
|
# 3

06-07-2010 10:40 PM
|
|
|
Todd,
Could you send a segment of the GC logs from the beginning
through the first dozen or so full GC's?
Exactly which version of the JVM are you using?
java -version
will tell us.
Do you have a test setup where you could do some experiments?
Can you send the set of CMS flags you use? It might tell
us something about the GC behavior of you application.
Might not tell us anything but it's worth a look.
Jon
On 07/06/10 13:27, Todd Lipcon wrote:
> Hi all,
>
> I work on HBase, a distributed database written in Java. We generally
> run on large heaps (8GB+), and our object lifetime distribution has
> proven pretty problematic for garbage collection (we manage a multi-GB
> LRU cache inside the process, so in CMS we tenure a lot of byte arrays
> which later get collected).
>
> In Java6, we generally run with the CMS collector, tuning down
> CMSInitiatingOccupancyFraction and constraining MaxNewSize to achieve
> fairly low pause GC, but after a week or two of uptime we often run
> into full heap compaction which takes several minutes and wreaks havoc
> on the system.
>
> Needless to say, we've been watching the development of the G1 GC with
> anticipation for the last year or two. Finally it seems in the latest
> build of JDK7 it's stable enough for actual use (previously it would
> segfault within an hour or so). However, in my testing I'm seeing a
> fair amount of 8-10 second Full GC pauses.
>
> The flags I'm passing are:
>
> -Xmx8000m -XX:+UseG1GC -XX:MaxGCPauseMillis=20
> -XX:GCPauseIntervalMillis=80
>
> Most of the pauses I see in the GC log are around 10-20ms as expected:
>
> 2010-07-05T22:43:19.849-0700: 1680.079: [GC pause (partial),
> 0.01209000 secs]
> [Parallel Time: 10.5 ms]
> [Update RS (Start) (ms): 1680080.2 1680080.1 1680080.2
> 1680079.9 1680080.0 1680080.2 1680080.1 1680080.1 1680080.0
> 1680080.1 1680080.0 1680079.9 1680081.5]
> [Update RS (ms): 1.4 2.0 2.2 1.8 1.7 1.4 2.5 2.2 1.9
> 2.5 1.7 1.7 0.1
> Avg: 1.8, Min: 0.1, Max: 2.5]
> [Processed Buffers : 8 1 3 1 1 7 3 2 6 2 7 8 3
> Sum: 52, Avg: 4, Min: 1, Max: 8]
> [Ext Root Scanning (ms): 0.7 0.5 0.5 0.3 0.5 0.7 0.4 0.5
> 0.4 0.5 0.3 0.3 0.0
> Avg: 0.4, Min: 0.0, Max: 0.7]
> [Mark Stack Scanning (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0
> 0.0 0.0 0.0 0.0 0.0 0.0
> Avg: 0.0, Min: 0.0, Max: 0.0]
> [Scan RS (ms): 0.9 0.4 0.1 0.7 0.7 0.9 0.0 0.1 0.6 0.0
> 0.8 0.8 0.9
> Avg: 0.5, Min: 0.0, Max: 0.9]
> [Object Copy (ms): 7.2 7.2 7.3 7.3 7.1 7.1 7.0 7.2 7.1
> 6.9 7.1 7.1 7.0
> Avg: 7.1, Min: 6.9, Max: 7.3]
> [Termination (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
> 0.0 0.0 0.0 0.0
> Avg: 0.0, Min: 0.0, Max: 0.0]
> [Other: 0.6 ms]
> [Clear CT: 0.5 ms]
> [Other: 1.1 ms]
> [Choose CSet: 0.0 ms]
> [ 7677M->7636M(8000M)]
> [Times: user=0.12 sys=0.00, real=0.01 secs]
>
> But every 5-10 minutes I see a GC pause that lasts 10-15 seconds:
> [todd@monster01 logs]$ grep 'Full GC' gc-hbase.log | tail
> 2010-07-06T12:50:41.216-0700: 52521.446: [Full GC 7934M->4865M(8000M),
> 9.8907800 secs]
> 2010-07-06T12:55:39.802-0700: 52820.032: [Full GC 7930M->4964M(8000M),
> 9.9025520 secs]
> 2010-07-06T13:02:26.872-0700: 53227.102: [Full GC 7934M->4882M(8000M),
> 10.1232190 secs]
> 2010-07-06T13:09:41.049-0700: 53661.279: [Full GC 7938M->5002M(8000M),
> 10.4997760 secs]
> 2010-07-06T13:18:51.531-0700: 54211.761: [Full GC 7938M->4962M(8000M),
> 11.0497380 secs]
>
> These pauses are pretty unacceptable for soft real time operation.
>
> Am I missing some tuning that should be done for G1GC for applications
> like this? Is 20ms out of 80ms too aggressive a target for the garbage
> rates we're generating?
>
> My actual live heap usage should be very stable around 5GB - the
> application very carefully accounts its live object set at around 60%
> of the max heap (as you can see in the logs above).
>
> At this point we are considering doing crazy things like ripping out
> our main memory consumers into a custom slab allocator, and manually
> reference count the byte array slices. But, if we can get G1GC to work
> for us, it will save a lot of engineering on the application side!
>
> Thanks
> -Todd
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
> ------------------------------------------------------------------------
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
Did you try doubling the heap size? You might want to post a full
log so we can see what's happening between those full collections.
Also, If you have comparable CMS logs
all the better, as a known starting point. The full gc's almost
look like the heap got too full, so it must mean that incremental
collection is not keeping up with the rate of garbage generation.
Also, what's the JDK build you are running?
-- ramki
On 07/06/10 13:27, Todd Lipcon wrote:
> Hi all,
>
> I work on HBase, a distributed database written in Java. We generally
> run on large heaps (8GB+), and our object lifetime distribution has
> proven pretty problematic for garbage collection (we manage a multi-GB
> LRU cache inside the process, so in CMS we tenure a lot of byte arrays
> which later get collected).
>
> In Java6, we generally run with the CMS collector, tuning down
> CMSInitiatingOccupancyFraction and constraining MaxNewSize to achieve
> fairly low pause GC, but after a week or two of uptime we often run into
> full heap compaction which takes several minutes and wreaks havoc on the
> system.
>
> Needless to say, we've been watching the development of the G1 GC with
> anticipation for the last year or two. Finally it seems in the latest
> build of JDK7 it's stable enough for actual use (previously it would
> segfault within an hour or so). However, in my testing I'm seeing a fair
> amount of 8-10 second Full GC pauses.
>
> The flags I'm passing are:
>
> -Xmx8000m -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:GCPauseIntervalMillis=80
>
> Most of the pauses I see in the GC log are around 10-20ms as expected:
>
> 2010-07-05T22:43:19.849-0700: 1680.079: [GC pause (partial), 0.01209000
> secs]
> [Parallel Time: 10.5 ms]
> [Update RS (Start) (ms): 1680080.2 1680080.1 1680080.2
> 1680079.9 1680080.0 1680080.2 1680080.1 1680080.1 1680080.0
> 1680080.1 1680080.0 1680079.9 1680081.5]
> [Update RS (ms): 1.4 2.0 2.2 1.8 1.7 1.4 2.5 2.2 1.9 2.5
> 1.7 1.7 0.1
> Avg: 1.8, Min: 0.1, Max: 2.5]
> [Processed Buffers : 8 1 3 1 1 7 3 2 6 2 7 8 3
> Sum: 52, Avg: 4, Min: 1, Max: 8]
> [Ext Root Scanning (ms): 0.7 0.5 0.5 0.3 0.5 0.7 0.4 0.5
> 0.4 0.5 0.3 0.3 0.0
> Avg: 0.4, Min: 0.0, Max: 0.7]
> [Mark Stack Scanning (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
> 0.0 0.0 0.0 0.0 0.0
> Avg: 0.0, Min: 0.0, Max: 0.0]
> [Scan RS (ms): 0.9 0.4 0.1 0.7 0.7 0.9 0.0 0.1 0.6 0.0
> 0.8 0.8 0.9
> Avg: 0.5, Min: 0.0, Max: 0.9]
> [Object Copy (ms): 7.2 7.2 7.3 7.3 7.1 7.1 7.0 7.2 7.1
> 6.9 7.1 7.1 7.0
> Avg: 7.1, Min: 6.9, Max: 7.3]
> [Termination (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
> 0.0 0.0 0.0 0.0
> Avg: 0.0, Min: 0.0, Max: 0.0]
> [Other: 0.6 ms]
> [Clear CT: 0.5 ms]
> [Other: 1.1 ms]
> [Choose CSet: 0.0 ms]
> [ 7677M->7636M(8000M)]
> [Times: user=0.12 sys=0.00, real=0.01 secs]
>
> But every 5-10 minutes I see a GC pause that lasts 10-15 seconds:
> [todd@monster01 logs]$ grep 'Full GC' gc-hbase.log | tail
> 2010-07-06T12:50:41.216-0700: 52521.446: [Full GC 7934M->4865M(8000M),
> 9.8907800 secs]
> 2010-07-06T12:55:39.802-0700: 52820.032: [Full GC 7930M->4964M(8000M),
> 9.9025520 secs]
> 2010-07-06T13:02:26.872-0700: 53227.102: [Full GC 7934M->4882M(8000M),
> 10.1232190 secs]
> 2010-07-06T13:09:41.049-0700: 53661.279: [Full GC 7938M->5002M(8000M),
> 10.4997760 secs]
> 2010-07-06T13:18:51.531-0700: 54211.761: [Full GC 7938M->4962M(8000M),
> 11.0497380 secs]
>
> These pauses are pretty unacceptable for soft real time operation.
>
> Am I missing some tuning that should be done for G1GC for applications
> like this? Is 20ms out of 80ms too aggressive a target for the garbage
> rates we're generating?
>
> My actual live heap usage should be very stable around 5GB - the
> application very carefully accounts its live object set at around 60% of
> the max heap (as you can see in the logs above).
>
> At this point we are considering doing crazy things like ripping out our
> main memory consumers into a custom slab allocator, and manually
> reference count the byte array slices. But, if we can get G1GC to work
> for us, it will save a lot of engineering on the application side!
>
> Thanks
> -Todd
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>
>
> ------------------------------------------------------------------------
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
I also work with Todd on this systems (I am one of the other people
with the alternate CMS config) and doubling the heap size from 8 GB to
16GB is a little insane... we'd like to have some amount of reasonable
memory efficiency here... The thing is the more we can get out of our
ram for this block cache, the better performing our systems are. Also
a lot of the settings are self tuning, so if we up the Xmx the size of
the block cache is scaled as well.
-ryan
On Tue, Jul 6, 2010 at 2:12 PM, Y. S. Ramakrishna
<> wrote:
> Did you try doubling the heap size? You might want to post a full
> log so we can see what's happening between those full collections.
> Also, If you have comparable CMS logs
> all the better, as a known starting point. The full gc's almost
> look like the heap got too full, so it must mean that incremental
> collection is not keeping up with the rate of garbage generation.
> Also, what's the JDK build you are running?
>
> -- ramki
>
> On 07/06/10 13:27, Todd Lipcon wrote:
>> Hi all,
>>
>> I work on HBase, a distributed database written in Java. We generally
>> run on large heaps (8GB+), and our object lifetime distribution has
>> proven pretty problematic for garbage collection (we manage a multi-GB
>> LRU cache inside the process, so in CMS we tenure a lot of byte arrays
>> which later get collected).
>>
>> In Java6, we generally run with the CMS collector, tuning down
>> CMSInitiatingOccupancyFraction and constraining MaxNewSize to achieve
>> fairly low pause GC, but after a week or two of uptime we often run into
>> full heap compaction which takes several minutes and wreaks havoc on the
>> system.
>>
>> Needless to say, we've been watching the development of the G1 GC with
>> anticipation for the last year or two. Finally it seems in the latest
>> build of JDK7 it's stable enough for actual use (previously it would
>> segfault within an hour or so). However, in my testing I'm seeing a fair
>> amount of 8-10 second Full GC pauses.
>>
>> The flags I'm passing are:
>>
>> -Xmx8000m -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:GCPauseIntervalMillis=80
>>
>> Most of the pauses I see in the GC log are around 10-20ms as expected:
>>
>> 2010-07-05T22:43:19.849-0700: 1680.079: [GC pause (partial), 0.01209000
>> secs]
>> [Parallel Time: 10.5 ms]
>> [Update RS (Start) (ms): 1680080.2 1680080.1 1680080.2
>> 1680079.9 1680080.0 1680080.2 1680080.1 1680080.1 1680080.0
>> 1680080.1 1680080.0 1680079.9 1680081.5]
>> [Update RS (ms): 1.4 2.0 2.2 1.8 1.7 1.4 2.5 2.2 1.9 2.5
>> 1.7 1.7 0.1
>> Avg: 1.8, Min: 0.1, Max: 2.5]
>> [Processed Buffers : 8 1 3 1 1 7 3 2 6 2 7 8 3
>> Sum: 52, Avg: 4, Min: 1, Max: 8]
>> [Ext Root Scanning (ms): 0.7 0.5 0.5 0.3 0.5 0.7 0.4 0.5
>> 0.4 0.5 0.3 0.3 0.0
>> Avg: 0.4, Min: 0.0, Max: 0.7]
>> [Mark Stack Scanning (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
>> 0.0 0.0 0.0 0.0 0.0
>> Avg: 0.0, Min: 0.0, Max: 0.0]
>> [Scan RS (ms): 0.9 0.4 0.1 0.7 0.7 0.9 0.0 0.1 0.6 0.0
>> 0.8 0.8 0.9
>> Avg: 0.5, Min: 0.0, Max: 0.9]
>> [Object Copy (ms): 7.2 7.2 7.3 7.3 7.1 7.1 7.0 7.2 7.1
>> 6.9 7.1 7.1 7.0
>> Avg: 7.1, Min: 6.9, Max: 7.3]
>> [Termination (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
>> 0.0 0.0 0.0 0.0
>> Avg: 0.0, Min: 0.0, Max: 0.0]
>> [Other: 0.6 ms]
>> [Clear CT: 0.5 ms]
>> [Other: 1.1 ms]
>> [Choose CSet: 0.0 ms]
>> [ 7677M->7636M(8000M)]
>> [Times: user=0.12 sys=0.00, real=0.01 secs]
>>
>> But every 5-10 minutes I see a GC pause that lasts 10-15 seconds:
>> [todd@monster01 logs]$ grep 'Full GC' gc-hbase.log | tail
>> 2010-07-06T12:50:41.216-0700: 52521.446: [Full GC 7934M->4865M(8000M),
>> 9.8907800 secs]
>> 2010-07-06T12:55:39.802-0700: 52820.032: [Full GC 7930M->4964M(8000M),
>> 9.9025520 secs]
>> 2010-07-06T13:02:26.872-0700: 53227.102: [Full GC 7934M->4882M(8000M),
>> 10.1232190 secs]
>> 2010-07-06T13:09:41.049-0700: 53661.279: [Full GC 7938M->5002M(8000M),
>> 10.4997760 secs]
>> 2010-07-06T13:18:51.531-0700: 54211.761: [Full GC 7938M->4962M(8000M),
>> 11.0497380 secs]
>>
>> These pauses are pretty unacceptable for soft real time operation.
>>
>> Am I missing some tuning that should be done for G1GC for applications
>> like this? Is 20ms out of 80ms too aggressive a target for the garbage
>> rates we're generating?
>>
>> My actual live heap usage should be very stable around 5GB - the
>> application very carefully accounts its live object set at around 60% of
>> the max heap (as you can see in the logs above).
>>
>> At this point we are considering doing crazy things like ripping out our
>> main memory consumers into a custom slab allocator, and manually
>> reference count the byte array slices. But, if we can get G1GC to work
>> for us, it will save a lot of engineering on the application side!
>>
>> Thanks
>> -Todd
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
>>
>> ------------------------------------------------------------------------
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
|
# 4

07-07-2010 12:57 AM
|
|
|
Todd,
Could you send a segment of the GC logs from the beginning
through the first dozen or so full GC's?
Exactly which version of the JVM are you using?
java -version
will tell us.
Do you have a test setup where you could do some experiments?
Can you send the set of CMS flags you use? It might tell
us something about the GC behavior of you application.
Might not tell us anything but it's worth a look.
Jon
On 07/06/10 13:27, Todd Lipcon wrote:
> Hi all,
>
> I work on HBase, a distributed database written in Java. We generally
> run on large heaps (8GB+), and our object lifetime distribution has
> proven pretty problematic for garbage collection (we manage a multi-GB
> LRU cache inside the process, so in CMS we tenure a lot of byte arrays
> which later get collected).
>
> In Java6, we generally run with the CMS collector, tuning down
> CMSInitiatingOccupancyFraction and constraining MaxNewSize to achieve
> fairly low pause GC, but after a week or two of uptime we often run
> into full heap compaction which takes several minutes and wreaks havoc
> on the system.
>
> Needless to say, we've been watching the development of the G1 GC with
> anticipation for the last year or two. Finally it seems in the latest
> build of JDK7 it's stable enough for actual use (previously it would
> segfault within an hour or so). However, in my testing I'm seeing a
> fair amount of 8-10 second Full GC pauses.
>
> The flags I'm passing are:
>
> -Xmx8000m -XX:+UseG1GC -XX:MaxGCPauseMillis=20
> -XX:GCPauseIntervalMillis=80
>
> Most of the pauses I see in the GC log are around 10-20ms as expected:
>
> 2010-07-05T22:43:19.849-0700: 1680.079: [GC pause (partial),
> 0.01209000 secs]
> [Parallel Time: 10.5 ms]
> [Update RS (Start) (ms): 1680080.2 1680080.1 1680080.2
> 1680079.9 1680080.0 1680080.2 1680080.1 1680080.1 1680080.0
> 1680080.1 1680080.0 1680079.9 1680081.5]
> [Update RS (ms): 1.4 2.0 2.2 1.8 1.7 1.4 2.5 2.2 1.9
> 2.5 1.7 1.7 0.1
> Avg: 1.8, Min: 0.1, Max: 2.5]
> [Processed Buffers : 8 1 3 1 1 7 3 2 6 2 7 8 3
> Sum: 52, Avg: 4, Min: 1, Max: 8]
> [Ext Root Scanning (ms): 0.7 0.5 0.5 0.3 0.5 0.7 0.4 0.5
> 0.4 0.5 0.3 0.3 0.0
> Avg: 0.4, Min: 0.0, Max: 0.7]
> [Mark Stack Scanning (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0
> 0.0 0.0 0.0 0.0 0.0 0.0
> Avg: 0.0, Min: 0.0, Max: 0.0]
> [Scan RS (ms): 0.9 0.4 0.1 0.7 0.7 0.9 0.0 0.1 0.6 0.0
> 0.8 0.8 0.9
> Avg: 0.5, Min: 0.0, Max: 0.9]
> [Object Copy (ms): 7.2 7.2 7.3 7.3 7.1 7.1 7.0 7.2 7.1
> 6.9 7.1 7.1 7.0
> Avg: 7.1, Min: 6.9, Max: 7.3]
> [Termination (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
> 0.0 0.0 0.0 0.0
> Avg: 0.0, Min: 0.0, Max: 0.0]
> [Other: 0.6 ms]
> [Clear CT: 0.5 ms]
> [Other: 1.1 ms]
> [Choose CSet: 0.0 ms]
> [ 7677M->7636M(8000M)]
> [Times: user=0.12 sys=0.00, real=0.01 secs]
>
> But every 5-10 minutes I see a GC pause that lasts 10-15 seconds:
> [todd@monster01 logs]$ grep 'Full GC' gc-hbase.log | tail
> 2010-07-06T12:50:41.216-0700: 52521.446: [Full GC 7934M->4865M(8000M),
> 9.8907800 secs]
> 2010-07-06T12:55:39.802-0700: 52820.032: [Full GC 7930M->4964M(8000M),
> 9.9025520 secs]
> 2010-07-06T13:02:26.872-0700: 53227.102: [Full GC 7934M->4882M(8000M),
> 10.1232190 secs]
> 2010-07-06T13:09:41.049-0700: 53661.279: [Full GC 7938M->5002M(8000M),
> 10.4997760 secs]
> 2010-07-06T13:18:51.531-0700: 54211.761: [Full GC 7938M->4962M(8000M),
> 11.0497380 secs]
>
> These pauses are pretty unacceptable for soft real time operation.
>
> Am I missing some tuning that should be done for G1GC for applications
> like this? Is 20ms out of 80ms too aggressive a target for the garbage
> rates we're generating?
>
> My actual live heap usage should be very stable around 5GB - the
> application very carefully accounts its live object set at around 60%
> of the max heap (as you can see in the logs above).
>
> At this point we are considering doing crazy things like ripping out
> our main memory consumers into a custom slab allocator, and manually
> reference count the byte array slices. But, if we can get G1GC to work
> for us, it will save a lot of engineering on the application side!
>
> Thanks
> -Todd
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
> ------------------------------------------------------------------------
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
Did you try doubling the heap size? You might want to post a full
log so we can see what's happening between those full collections.
Also, If you have comparable CMS logs
all the better, as a known starting point. The full gc's almost
look like the heap got too full, so it must mean that incremental
collection is not keeping up with the rate of garbage generation.
Also, what's the JDK build you are running?
-- ramki
On 07/06/10 13:27, Todd Lipcon wrote:
> Hi all,
>
> I work on HBase, a distributed database written in Java. We generally
> run on large heaps (8GB+), and our object lifetime distribution has
> proven pretty problematic for garbage collection (we manage a multi-GB
> LRU cache inside the process, so in CMS we tenure a lot of byte arrays
> which later get collected).
>
> In Java6, we generally run with the CMS collector, tuning down
> CMSInitiatingOccupancyFraction and constraining MaxNewSize to achieve
> fairly low pause GC, but after a week or two of uptime we often run into
> full heap compaction which takes several minutes and wreaks havoc on the
> system.
>
> Needless to say, we've been watching the development of the G1 GC with
> anticipation for the last year or two. Finally it seems in the latest
> build of JDK7 it's stable enough for actual use (previously it would
> segfault within an hour or so). However, in my testing I'm seeing a fair
> amount of 8-10 second Full GC pauses.
>
> The flags I'm passing are:
>
> -Xmx8000m -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:GCPauseIntervalMillis=80
>
> Most of the pauses I see in the GC log are around 10-20ms as expected:
>
> 2010-07-05T22:43:19.849-0700: 1680.079: [GC pause (partial), 0.01209000
> secs]
> [Parallel Time: 10.5 ms]
> [Update RS (Start) (ms): 1680080.2 1680080.1 1680080.2
> 1680079.9 1680080.0 1680080.2 1680080.1 1680080.1 1680080.0
> 1680080.1 1680080.0 1680079.9 1680081.5]
> [Update RS (ms): 1.4 2.0 2.2 1.8 1.7 1.4 2.5 2.2 1.9 2.5
> 1.7 1.7 0.1
> Avg: 1.8, Min: 0.1, Max: 2.5]
> [Processed Buffers : 8 1 3 1 1 7 3 2 6 2 7 8 3
> Sum: 52, Avg: 4, Min: 1, Max: 8]
> [Ext Root Scanning (ms): 0.7 0.5 0.5 0.3 0.5 0.7 0.4 0.5
> 0.4 0.5 0.3 0.3 0.0
> Avg: 0.4, Min: 0.0, Max: 0.7]
> [Mark Stack Scanning (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
> 0.0 0.0 0.0 0.0 0.0
> Avg: 0.0, Min: 0.0, Max: 0.0]
> [Scan RS (ms): 0.9 0.4 0.1 0.7 0.7 0.9 0.0 0.1 0.6 0.0
> 0.8 0.8 0.9
> Avg: 0.5, Min: 0.0, Max: 0.9]
> [Object Copy (ms): 7.2 7.2 7.3 7.3 7.1 7.1 7.0 7.2 7.1
> 6.9 7.1 7.1 7.0
> Avg: 7.1, Min: 6.9, Max: 7.3]
> [Termination (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
> 0.0 0.0 0.0 0.0
> Avg: 0.0, Min: 0.0, Max: 0.0]
> [Other: 0.6 ms]
> [Clear CT: 0.5 ms]
> [Other: 1.1 ms]
> [Choose CSet: 0.0 ms]
> [ 7677M->7636M(8000M)]
> [Times: user=0.12 sys=0.00, real=0.01 secs]
>
> But every 5-10 minutes I see a GC pause that lasts 10-15 seconds:
> [todd@monster01 logs]$ grep 'Full GC' gc-hbase.log | tail
> 2010-07-06T12:50:41.216-0700: 52521.446: [Full GC 7934M->4865M(8000M),
> 9.8907800 secs]
> 2010-07-06T12:55:39.802-0700: 52820.032: [Full GC 7930M->4964M(8000M),
> 9.9025520 secs]
> 2010-07-06T13:02:26.872-0700: 53227.102: [Full GC 7934M->4882M(8000M),
> 10.1232190 secs]
> 2010-07-06T13:09:41.049-0700: 53661.279: [Full GC 7938M->5002M(8000M),
> 10.4997760 secs]
> 2010-07-06T13:18:51.531-0700: 54211.761: [Full GC 7938M->4962M(8000M),
> 11.0497380 secs]
>
> These pauses are pretty unacceptable for soft real time operation.
>
> Am I missing some tuning that should be done for G1GC for applications
> like this? Is 20ms out of 80ms too aggressive a target for the garbage
> rates we're generating?
>
> My actual live heap usage should be very stable around 5GB - the
> application very carefully accounts its live object set at around 60% of
> the max heap (as you can see in the logs above).
>
> At this point we are considering doing crazy things like ripping out our
> main memory consumers into a custom slab allocator, and manually
> reference count the byte array slices. But, if we can get G1GC to work
> for us, it will save a lot of engineering on the application side!
>
> Thanks
> -Todd
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>
>
> ------------------------------------------------------------------------
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
I also work with Todd on this systems (I am one of the other people
with the alternate CMS config) and doubling the heap size from 8 GB to
16GB is a little insane... we'd like to have some amount of reasonable
memory efficiency here... The thing is the more we can get out of our
ram for this block cache, the better performing our systems are. Also
a lot of the settings are self tuning, so if we up the Xmx the size of
the block cache is scaled as well.
-ryan
On Tue, Jul 6, 2010 at 2:12 PM, Y. S. Ramakrishna
<> wrote:
> Did you try doubling the heap size? You might want to post a full
> log so we can see what's happening between those full collections.
> Also, If you have comparable CMS logs
> all the better, as a known starting point. The full gc's almost
> look like the heap got too full, so it must mean that incremental
> collection is not keeping up with the rate of garbage generation.
> Also, what's the JDK build you are running?
>
> -- ramki
>
> On 07/06/10 13:27, Todd Lipcon wrote:
>> Hi all,
>>
>> I work on HBase, a distributed database written in Java. We generally
>> run on large heaps (8GB+), and our object lifetime distribution has
>> proven pretty problematic for garbage collection (we manage a multi-GB
>> LRU cache inside the process, so in CMS we tenure a lot of byte arrays
>> which later get collected).
>>
>> In Java6, we generally run with the CMS collector, tuning down
>> CMSInitiatingOccupancyFraction and constraining MaxNewSize to achieve
>> fairly low pause GC, but after a week or two of uptime we often run into
>> full heap compaction which takes several minutes and wreaks havoc on the
>> system.
>>
>> Needless to say, we've been watching the development of the G1 GC with
>> anticipation for the last year or two. Finally it seems in the latest
>> build of JDK7 it's stable enough for actual use (previously it would
>> segfault within an hour or so). However, in my testing I'm seeing a fair
>> amount of 8-10 second Full GC pauses.
>>
>> The flags I'm passing are:
>>
>> -Xmx8000m -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:GCPauseIntervalMillis=80
>>
>> Most of the pauses I see in the GC log are around 10-20ms as expected:
>>
>> 2010-07-05T22:43:19.849-0700: 1680.079: [GC pause (partial), 0.01209000
>> secs]
>> [Parallel Time: 10.5 ms]
>> [Update RS (Start) (ms): 1680080.2 1680080.1 1680080.2
>> 1680079.9 1680080.0 1680080.2 1680080.1 1680080.1 1680080.0
>> 1680080.1 1680080.0 1680079.9 1680081.5]
>> [Update RS (ms): 1.4 2.0 2.2 1.8 1.7 1.4 2.5 2.2 1.9 2.5
>> 1.7 1.7 0.1
>> Avg: 1.8, Min: 0.1, Max: 2.5]
>> [Processed Buffers : 8 1 3 1 1 7 3 2 6 2 7 8 3
>> Sum: 52, Avg: 4, Min: 1, Max: 8]
>> [Ext Root Scanning (ms): 0.7 0.5 0.5 0.3 0.5 0.7 0.4 0.5
>> 0.4 0.5 0.3 0.3 0.0
>> Avg: 0.4, Min: 0.0, Max: 0.7]
>> [Mark Stack Scanning (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
>> 0.0 0.0 0.0 0.0 0.0
>> Avg: 0.0, Min: 0.0, Max: 0.0]
>> [Scan RS (ms): 0.9 0.4 0.1 0.7 0.7 0.9 0.0 0.1 0.6 0.0
>> 0.8 0.8 0.9
>> Avg: 0.5, Min: 0.0, Max: 0.9]
>> [Object Copy (ms): 7.2 7.2 7.3 7.3 7.1 7.1 7.0 7.2 7.1
>> 6.9 7.1 7.1 7.0
>> Avg: 7.1, Min: 6.9, Max: 7.3]
>> [Termination (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
>> 0.0 0.0 0.0 0.0
>> Avg: 0.0, Min: 0.0, Max: 0.0]
>> [Other: 0.6 ms]
>> [Clear CT: 0.5 ms]
>> [Other: 1.1 ms]
>> [Choose CSet: 0.0 ms]
>> [ 7677M->7636M(8000M)]
>> [Times: user=0.12 sys=0.00, real=0.01 secs]
>>
>> But every 5-10 minutes I see a GC pause that lasts 10-15 seconds:
>> [todd@monster01 logs]$ grep 'Full GC' gc-hbase.log | tail
>> 2010-07-06T12:50:41.216-0700: 52521.446: [Full GC 7934M->4865M(8000M),
>> 9.8907800 secs]
>> 2010-07-06T12:55:39.802-0700: 52820.032: [Full GC 7930M->4964M(8000M),
>> 9.9025520 secs]
>> 2010-07-06T13:02:26.872-0700: 53227.102: [Full GC 7934M->4882M(8000M),
>> 10.1232190 secs]
>> 2010-07-06T13:09:41.049-0700: 53661.279: [Full GC 7938M->5002M(8000M),
>> 10.4997760 secs]
>> 2010-07-06T13:18:51.531-0700: 54211.761: [Full GC 7938M->4962M(8000M),
>> 11.0497380 secs]
>>
>> These pauses are pretty unacceptable for soft real time operation.
>>
>> Am I missing some tuning that should be done for G1GC for applications
>> like this? Is 20ms out of 80ms too aggressive a target for the garbage
>> rates we're generating?
>>
>> My actual live heap usage should be very stable around 5GB - the
>> application very carefully accounts its live object set at around 60% of
>> the max heap (as you can see in the logs above).
>>
>> At this point we are considering doing crazy things like ripping out our
>> main memory consumers into a custom slab allocator, and manually
>> reference count the byte array slices. But, if we can get G1GC to work
>> for us, it will save a lot of engineering on the application side!
>>
>> Thanks
>> -Todd
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
>>
>> ------------------------------------------------------------------------
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
Two questions:
(1) do you enable class unloading with CMS? (I do not see that
below in yr option list, but wondered.)
(2) does your application load classes, or intern a lot of strings?
If i am reading the logs right, G1 appears to reclaim less
and less of the heap in each cycle until a full collection
intervenes, and I have no real explanation for this behaviour
except that perhaps there's something in the perm gen that
keeps stuff in the remainder of the heap artificially live.
G1 does not incrementally collect the young gen, so this is
plausible. But CMS does not either by default and I do not see
that option in the CMS options list you gave below. It would
be instructive to see what the comparable CMS logs look like.
May be then you could start with the same heap shapes for the
two and see if you can get to the bottom of the full gc (which
as i understand you get to more quickly w/G1 than you did
w/CMS).
-- ramki
On 07/06/10 14:24, Todd Lipcon wrote:
> On Tue, Jul 6, 2010 at 2:09 PM, Jon Masamitsu <
> > wrote:
>
> Todd,
>
> Could you send a segment of the GC logs from the beginning
> through the first dozen or so full GC's?
>
>
> Sure, I just put it online at:
>
> http://cloudera-todd.s3.amazonaws.com/gc-log-g1gc.txt
>
>
>
> Exactly which version of the JVM are you using?
>
> java -version
>
> will tell us.
>
>
> Latest as of last night:
>
> [todd@monster01 ~]$ ./jdk1.7.0/jre/bin/java -version
> java version "1.7.0-ea"
> Java(TM) SE Runtime Environment (build 1.7.0-ea-b99)
> Java HotSpot(TM) 64-Bit Server VM (build 19.0-b03, mixed mode)
>
>
> Do you have a test setup where you could do some experiments?
>
>
> Sure, I have a five node cluster here where I do lots of testing, happy
> to try different builds/options/etc (though I probably don't have time
> to apply patches and rebuild the JDK myself)
>
>
> Can you send the set of CMS flags you use? It might tell
> us something about the GC behavior of you application.
> Might not tell us anything but it's worth a look.
>
>
> Different customers have found different flags to work well for them.
> One user uses the following:
>
>
> -Xmx12000m -XX:+HeapDumpOnOutOfMemoryError -XX:+UseConcMarkSweepGC \
> -XX:NewSize=64m -XX:MaxNewSize=64m -XX:CMSInitiatingOccupancyFraction=40 \
>
>
> Another uses:
>
>
> -XX:+DoEscapeAnalysis -XX:+AggressiveOpts -XX:+UseConcMarkSweepGC
> -XX:NewSize=64m -XX:MaxNewSize=64m
> -XX:CMSInitiatingOccupancyFraction=88 -verbose:gc -XX:+PrintGCDetails
>
>
>
>
>
>
>
> The particular tuning options probably depend on the actual cache
> workload of the user. I tend to recommend CMSInitiatingOccupancyFraction
> around 75 or so, since the software maintains about 60% heap usage. I
> also think a NewSize slightly larger would improve things a bit, but if
> it gets more than 256m or so, the ParNew pauses start to be too long for
> a lot of use cases.
>
> Regarding CMS logs, I can probably restart this test later this
> afternoon on CMS and run it for a couple hours, but it isn't likely to
> hit the multi-minute compaction that quickly. It happens more in the wild.
>
> -Todd
>
>
>
> On 07/06/10 13:27, Todd Lipcon wrote:
>> Hi all,
>>
>> I work on HBase, a distributed database written in Java. We
>> generally run on large heaps (8GB+), and our object lifetime
>> distribution has proven pretty problematic for garbage collection
>> (we manage a multi-GB LRU cache inside the process, so in CMS we
>> tenure a lot of byte arrays which later get collected).
>>
>> In Java6, we generally run with the CMS collector, tuning down
>> CMSInitiatingOccupancyFraction and constraining MaxNewSize to
>> achieve fairly low pause GC, but after a week or two of uptime we
>> often run into full heap compaction which takes several minutes
>> and wreaks havoc on the system.
>>
>> Needless to say, we've been watching the development of the G1 GC
>> with anticipation for the last year or two. Finally it seems in
>> the latest build of JDK7 it's stable enough for actual use
>> (previously it would segfault within an hour or so). However, in
>> my testing I'm seeing a fair amount of 8-10 second Full GC pauses.
>>
>> The flags I'm passing are:
>>
>> -Xmx8000m -XX:+UseG1GC -XX:MaxGCPauseMillis=20
>> -XX:GCPauseIntervalMillis=80
>>
>> Most of the pauses I see in the GC log are around 10-20ms as expected:
>>
>> 2010-07-05T22:43:19.849-0700: 1680.079: [GC pause (partial),
>> 0.01209000 secs]
>> [Parallel Time: 10.5 ms]
>> [Update RS (Start) (ms): 1680080.2 1680080.1 1680080.2
>> 1680079.9 1680080.0 1680080.2 1680080.1 1680080.1 1680080.0
>> 1680080.1 1680080.0 1680079.9 1680081.5]
>> [Update RS (ms): 1.4 2.0 2.2 1.8 1.7 1.4 2.5 2.2
>> 1.9 2.5 1.7 1.7 0.1
>> Avg: 1.8, Min: 0.1, Max: 2.5]
>> [Processed Buffers : 8 1 3 1 1 7 3 2 6 2 7 8 3
>> Sum: 52, Avg: 4, Min: 1, Max: 8]
>> [Ext Root Scanning (ms): 0.7 0.5 0.5 0.3 0.5 0.7 0.4
>> 0.5 0.4 0.5 0.3 0.3 0.0
>> Avg: 0.4, Min: 0.0, Max: 0.7]
>> [Mark Stack Scanning (ms): 0.0 0.0 0.0 0.0 0.0 0.0
>> 0.0 0.0 0.0 0.0 0.0 0.0 0.0
>> Avg: 0.0, Min: 0.0, Max: 0.0]
>> [Scan RS (ms): 0.9 0.4 0.1 0.7 0.7 0.9 0.0 0.1 0.6
>> 0.0 0.8 0.8 0.9
>> Avg: 0.5, Min: 0.0, Max: 0.9]
>> [Object Copy (ms): 7.2 7.2 7.3 7.3 7.1 7.1 7.0 7.2
>> 7.1 6.9 7.1 7.1 7.0
>> Avg: 7.1, Min: 6.9, Max: 7.3]
>> [Termination (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
>> 0.0 0.0 0.0 0.0 0.0
>> Avg: 0.0, Min: 0.0, Max: 0.0]
>> [Other: 0.6 ms]
>> [Clear CT: 0.5 ms]
>> [Other: 1.1 ms]
>> [Choose CSet: 0.0 ms]
>> [ 7677M->7636M(8000M)]
>> [Times: user=0.12 sys=0.00, real=0.01 secs]
>>
>> But every 5-10 minutes I see a GC pause that lasts 10-15 seconds:
>> [todd@monster01 logs]$ grep 'Full GC' gc-hbase.log | tail
>> 2010-07-06T12:50:41.216-0700: 52521.446: [Full GC
>> 7934M->4865M(8000M), 9.8907800 secs]
>> 2010-07-06T12:55:39.802-0700: 52820.032: [Full GC
>> 7930M->4964M(8000M), 9.9025520 secs]
>> 2010-07-06T13:02:26.872-0700: 53227.102: [Full GC
>> 7934M->4882M(8000M), 10.1232190 secs]
>> 2010-07-06T13:09:41.049-0700: 53661.279: [Full GC
>> 7938M->5002M(8000M), 10.4997760 secs]
>> 2010-07-06T13:18:51.531-0700: 54211.761: [Full GC
>> 7938M->4962M(8000M), 11.0497380 secs]
>>
>> These pauses are pretty unacceptable for soft real time operation.
>>
>> Am I missing some tuning that should be done for G1GC for
>> applications like this? Is 20ms out of 80ms too aggressive a
>> target for the garbage rates we're generating?
>>
>> My actual live heap usage should be very stable around 5GB - the
>> application very carefully accounts its live object set at around
>> 60% of the max heap (as you can see in the logs above).
>>
>> At this point we are considering doing crazy things like ripping
>> out our main memory consumers into a custom slab allocator, and
>> manually reference count the byte array slices. But, if we can get
>> G1GC to work for us, it will save a lot of engineering on the
>> application side!
>>
>> Thanks
>> -Todd
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>> ------------------------------------------------------------------------
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
|
# 5

07-07-2010 07:28 PM
|
|
|
Todd,
Could you send a segment of the GC logs from the beginning
through the first dozen or so full GC's?
Exactly which version of the JVM are you using?
java -version
will tell us.
Do you have a test setup where you could do some experiments?
Can you send the set of CMS flags you use? It might tell
us something about the GC behavior of you application.
Might not tell us anything but it's worth a look.
Jon
On 07/06/10 13:27, Todd Lipcon wrote:
> Hi all,
>
> I work on HBase, a distributed database written in Java. We generally
> run on large heaps (8GB+), and our object lifetime distribution has
> proven pretty problematic for garbage collection (we manage a multi-GB
> LRU cache inside the process, so in CMS we tenure a lot of byte arrays
> which later get collected).
>
> In Java6, we generally run with the CMS collector, tuning down
> CMSInitiatingOccupancyFraction and constraining MaxNewSize to achieve
> fairly low pause GC, but after a week or two of uptime we often run
> into full heap compaction which takes several minutes and wreaks havoc
> on the system.
>
> Needless to say, we've been watching the development of the G1 GC with
> anticipation for the last year or two. Finally it seems in the latest
> build of JDK7 it's stable enough for actual use (previously it would
> segfault within an hour or so). However, in my testing I'm seeing a
> fair amount of 8-10 second Full GC pauses.
>
> The flags I'm passing are:
>
> -Xmx8000m -XX:+UseG1GC -XX:MaxGCPauseMillis=20
> -XX:GCPauseIntervalMillis=80
>
> Most of the pauses I see in the GC log are around 10-20ms as expected:
>
> 2010-07-05T22:43:19.849-0700: 1680.079: [GC pause (partial),
> 0.01209000 secs]
> [Parallel Time: 10.5 ms]
> [Update RS (Start) (ms): 1680080.2 1680080.1 1680080.2
> 1680079.9 1680080.0 1680080.2 1680080.1 1680080.1 1680080.0
> 1680080.1 1680080.0 1680079.9 1680081.5]
> [Update RS (ms): 1.4 2.0 2.2 1.8 1.7 1.4 2.5 2.2 1.9
> 2.5 1.7 1.7 0.1
> Avg: 1.8, Min: 0.1, Max: 2.5]
> [Processed Buffers : 8 1 3 1 1 7 3 2 6 2 7 8 3
> Sum: 52, Avg: 4, Min: 1, Max: 8]
> [Ext Root Scanning (ms): 0.7 0.5 0.5 0.3 0.5 0.7 0.4 0.5
> 0.4 0.5 0.3 0.3 0.0
> Avg: 0.4, Min: 0.0, Max: 0.7]
> [Mark Stack Scanning (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0
> 0.0 0.0 0.0 0.0 0.0 0.0
> Avg: 0.0, Min: 0.0, Max: 0.0]
> [Scan RS (ms): 0.9 0.4 0.1 0.7 0.7 0.9 0.0 0.1 0.6 0.0
> 0.8 0.8 0.9
> Avg: 0.5, Min: 0.0, Max: 0.9]
> [Object Copy (ms): 7.2 7.2 7.3 7.3 7.1 7.1 7.0 7.2 7.1
> 6.9 7.1 7.1 7.0
> Avg: 7.1, Min: 6.9, Max: 7.3]
> [Termination (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
> 0.0 0.0 0.0 0.0
> Avg: 0.0, Min: 0.0, Max: 0.0]
> [Other: 0.6 ms]
> [Clear CT: 0.5 ms]
> [Other: 1.1 ms]
> [Choose CSet: 0.0 ms]
> [ 7677M->7636M(8000M)]
> [Times: user=0.12 sys=0.00, real=0.01 secs]
>
> But every 5-10 minutes I see a GC pause that lasts 10-15 seconds:
> [todd@monster01 logs]$ grep 'Full GC' gc-hbase.log | tail
> 2010-07-06T12:50:41.216-0700: 52521.446: [Full GC 7934M->4865M(8000M),
> 9.8907800 secs]
> 2010-07-06T12:55:39.802-0700: 52820.032: [Full GC 7930M->4964M(8000M),
> 9.9025520 secs]
> 2010-07-06T13:02:26.872-0700: 53227.102: [Full GC 7934M->4882M(8000M),
> 10.1232190 secs]
> 2010-07-06T13:09:41.049-0700: 53661.279: [Full GC 7938M->5002M(8000M),
> 10.4997760 secs]
> 2010-07-06T13:18:51.531-0700: 54211.761: [Full GC 7938M->4962M(8000M),
> 11.0497380 secs]
>
> These pauses are pretty unacceptable for soft real time operation.
>
> Am I missing some tuning that should be done for G1GC for applications
> like this? Is 20ms out of 80ms too aggressive a target for the garbage
> rates we're generating?
>
> My actual live heap usage should be very stable around 5GB - the
> application very carefully accounts its live object set at around 60%
> of the max heap (as you can see in the logs above).
>
> At this point we are considering doing crazy things like ripping out
> our main memory consumers into a custom slab allocator, and manually
> reference count the byte array slices. But, if we can get G1GC to work
> for us, it will save a lot of engineering on the application side!
>
> Thanks
> -Todd
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
> ------------------------------------------------------------------------
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
Did you try doubling the heap size? You might want to post a full
log so we can see what's happening between those full collections.
Also, If you have comparable CMS logs
all the better, as a known starting point. The full gc's almost
look like the heap got too full, so it must mean that incremental
collection is not keeping up with the rate of garbage generation.
Also, what's the JDK build you are running?
-- ramki
On 07/06/10 13:27, Todd Lipcon wrote:
> Hi all,
>
> I work on HBase, a distributed database written in Java. We generally
> run on large heaps (8GB+), and our object lifetime distribution has
> proven pretty problematic for garbage collection (we manage a multi-GB
> LRU cache inside the process, so in CMS we tenure a lot of byte arrays
> which later get collected).
>
> In Java6, we generally run with the CMS collector, tuning down
> CMSInitiatingOccupancyFraction and constraining MaxNewSize to achieve
> fairly low pause GC, but after a week or two of uptime we often run into
> full heap compaction which takes several minutes and wreaks havoc on the
> system.
>
> Needless to say, we've been watching the development of the G1 GC with
> anticipation for the last year or two. Finally it seems in the latest
> build of JDK7 it's stable enough for actual use (previously it would
> segfault within an hour or so). However, in my testing I'm seeing a fair
> amount of 8-10 second Full GC pauses.
>
> The flags I'm passing are:
>
> -Xmx8000m -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:GCPauseIntervalMillis=80
>
> Most of the pauses I see in the GC log are around 10-20ms as expected:
>
> 2010-07-05T22:43:19.849-0700: 1680.079: [GC pause (partial), 0.01209000
> secs]
> [Parallel Time: 10.5 ms]
> [Update RS (Start) (ms): 1680080.2 1680080.1 1680080.2
> 1680079.9 1680080.0 1680080.2 1680080.1 1680080.1 1680080.0
> 1680080.1 1680080.0 1680079.9 1680081.5]
> [Update RS (ms): 1.4 2.0 2.2 1.8 1.7 1.4 2.5 2.2 1.9 2.5
> 1.7 1.7 0.1
> Avg: 1.8, Min: 0.1, Max: 2.5]
> [Processed Buffers : 8 1 3 1 1 7 3 2 6 2 7 8 3
> Sum: 52, Avg: 4, Min: 1, Max: 8]
> [Ext Root Scanning (ms): 0.7 0.5 0.5 0.3 0.5 0.7 0.4 0.5
> 0.4 0.5 0.3 0.3 0.0
> Avg: 0.4, Min: 0.0, Max: 0.7]
> [Mark Stack Scanning (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
> 0.0 0.0 0.0 0.0 0.0
> Avg: 0.0, Min: 0.0, Max: 0.0]
> [Scan RS (ms): 0.9 0.4 0.1 0.7 0.7 0.9 0.0 0.1 0.6 0.0
> 0.8 0.8 0.9
> Avg: 0.5, Min: 0.0, Max: 0.9]
> [Object Copy (ms): 7.2 7.2 7.3 7.3 7.1 7.1 7.0 7.2 7.1
> 6.9 7.1 7.1 7.0
> Avg: 7.1, Min: 6.9, Max: 7.3]
> [Termination (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
> 0.0 0.0 0.0 0.0
> Avg: 0.0, Min: 0.0, Max: 0.0]
> [Other: 0.6 ms]
> [Clear CT: 0.5 ms]
> [Other: 1.1 ms]
> [Choose CSet: 0.0 ms]
> [ 7677M->7636M(8000M)]
> [Times: user=0.12 sys=0.00, real=0.01 secs]
>
> But every 5-10 minutes I see a GC pause that lasts 10-15 seconds:
> [todd@monster01 logs]$ grep 'Full GC' gc-hbase.log | tail
> 2010-07-06T12:50:41.216-0700: 52521.446: [Full GC 7934M->4865M(8000M),
> 9.8907800 secs]
> 2010-07-06T12:55:39.802-0700: 52820.032: [Full GC 7930M->4964M(8000M),
> 9.9025520 secs]
> 2010-07-06T13:02:26.872-0700: 53227.102: [Full GC 7934M->4882M(8000M),
> 10.1232190 secs]
> 2010-07-06T13:09:41.049-0700: 53661.279: [Full GC 7938M->5002M(8000M),
> 10.4997760 secs]
> 2010-07-06T13:18:51.531-0700: 54211.761: [Full GC 7938M->4962M(8000M),
> 11.0497380 secs]
>
> These pauses are pretty unacceptable for soft real time operation.
>
> Am I missing some tuning that should be done for G1GC for applications
> like this? Is 20ms out of 80ms too aggressive a target for the garbage
> rates we're generating?
>
> My actual live heap usage should be very stable around 5GB - the
> application very carefully accounts its live object set at around 60% of
> the max heap (as you can see in the logs above).
>
> At this point we are considering doing crazy things like ripping out our
> main memory consumers into a custom slab allocator, and manually
> reference count the byte array slices. But, if we can get G1GC to work
> for us, it will save a lot of engineering on the application side!
>
> Thanks
> -Todd
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>
>
> ------------------------------------------------------------------------
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
I also work with Todd on this systems (I am one of the other people
with the alternate CMS config) and doubling the heap size from 8 GB to
16GB is a little insane... we'd like to have some amount of reasonable
memory efficiency here... The thing is the more we can get out of our
ram for this block cache, the better performing our systems are. Also
a lot of the settings are self tuning, so if we up the Xmx the size of
the block cache is scaled as well.
-ryan
On Tue, Jul 6, 2010 at 2:12 PM, Y. S. Ramakrishna
<> wrote:
> Did you try doubling the heap size? You might want to post a full
> log so we can see what's happening between those full collections.
> Also, If you have comparable CMS logs
> all the better, as a known starting point. The full gc's almost
> look like the heap got too full, so it must mean that incremental
> collection is not keeping up with the rate of garbage generation.
> Also, what's the JDK build you are running?
>
> -- ramki
>
> On 07/06/10 13:27, Todd Lipcon wrote:
>> Hi all,
>>
>> I work on HBase, a distributed database written in Java. We generally
>> run on large heaps (8GB+), and our object lifetime distribution has
>> proven pretty problematic for garbage collection (we manage a multi-GB
>> LRU cache inside the process, so in CMS we tenure a lot of byte arrays
>> which later get collected).
>>
>> In Java6, we generally run with the CMS collector, tuning down
>> CMSInitiatingOccupancyFraction and constraining MaxNewSize to achieve
>> fairly low pause GC, but after a week or two of uptime we often run into
>> full heap compaction which takes several minutes and wreaks havoc on the
>> system.
>>
>> Needless to say, we've been watching the development of the G1 GC with
>> anticipation for the last year or two. Finally it seems in the latest
>> build of JDK7 it's stable enough for actual use (previously it would
>> segfault within an hour or so). However, in my testing I'm seeing a fair
>> amount of 8-10 second Full GC pauses.
>>
>> The flags I'm passing are:
>>
>> -Xmx8000m -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:GCPauseIntervalMillis=80
>>
>> Most of the pauses I see in the GC log are around 10-20ms as expected:
>>
>> 2010-07-05T22:43:19.849-0700: 1680.079: [GC pause (partial), 0.01209000
>> secs]
>> [Parallel Time: 10.5 ms]
>> [Update RS (Start) (ms): 1680080.2 1680080.1 1680080.2
>> 1680079.9 1680080.0 1680080.2 1680080.1 1680080.1 1680080.0
>> 1680080.1 1680080.0 1680079.9 1680081.5]
>> [Update RS (ms): 1.4 2.0 2.2 1.8 1.7 1.4 2.5 2.2 1.9 2.5
>> 1.7 1.7 0.1
>> Avg: 1.8, Min: 0.1, Max: 2.5]
>> [Processed Buffers : 8 1 3 1 1 7 3 2 6 2 7 8 3
>> Sum: 52, Avg: 4, Min: 1, Max: 8]
>> [Ext Root Scanning (ms): 0.7 0.5 0.5 0.3 0.5 0.7 0.4 0.5
>> 0.4 0.5 0.3 0.3 0.0
>> Avg: 0.4, Min: 0.0, Max: 0.7]
>> [Mark Stack Scanning (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
>> 0.0 0.0 0.0 0.0 0.0
>> Avg: 0.0, Min: 0.0, Max: 0.0]
>> [Scan RS (ms): 0.9 0.4 0.1 0.7 0.7 0.9 0.0 0.1 0.6 0.0
>> 0.8 0.8 0.9
>> Avg: 0.5, Min: 0.0, Max: 0.9]
>> [Object Copy (ms): 7.2 7.2 7.3 7.3 7.1 7.1 7.0 7.2 7.1
>> 6.9 7.1 7.1 7.0
>> Avg: 7.1, Min: 6.9, Max: 7.3]
>> [Termination (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
>> 0.0 0.0 0.0 0.0
>> Avg: 0.0, Min: 0.0, Max: 0.0]
>> [Other: 0.6 ms]
>> [Clear CT: 0.5 ms]
>> [Other: 1.1 ms]
>> [Choose CSet: 0.0 ms]
>> [ 7677M->7636M(8000M)]
>> [Times: user=0.12 sys=0.00, real=0.01 secs]
>>
>> But every 5-10 minutes I see a GC pause that lasts 10-15 seconds:
>> [todd@monster01 logs]$ grep 'Full GC' gc-hbase.log | tail
>> 2010-07-06T12:50:41.216-0700: 52521.446: [Full GC 7934M->4865M(8000M),
>> 9.8907800 secs]
>> 2010-07-06T12:55:39.802-0700: 52820.032: [Full GC 7930M->4964M(8000M),
>> 9.9025520 secs]
>> 2010-07-06T13:02:26.872-0700: 53227.102: [Full GC 7934M->4882M(8000M),
>> 10.1232190 secs]
>> 2010-07-06T13:09:41.049-0700: 53661.279: [Full GC 7938M->5002M(8000M),
>> 10.4997760 secs]
>> 2010-07-06T13:18:51.531-0700: 54211.761: [Full GC 7938M->4962M(8000M),
>> 11.0497380 secs]
>>
>> These pauses are pretty unacceptable for soft real time operation.
>>
>> Am I missing some tuning that should be done for G1GC for applications
>> like this? Is 20ms out of 80ms too aggressive a target for the garbage
>> rates we're generating?
>>
>> My actual live heap usage should be very stable around 5GB - the
>> application very carefully accounts its live object set at around 60% of
>> the max heap (as you can see in the logs above).
>>
>> At this point we are considering doing crazy things like ripping out our
>> main memory consumers into a custom slab allocator, and manually
>> reference count the byte array slices. But, if we can get G1GC to work
>> for us, it will save a lot of engineering on the application side!
>>
>> Thanks
>> -Todd
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
>>
>> ------------------------------------------------------------------------
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
Two questions:
(1) do you enable class unloading with CMS? (I do not see that
below in yr option list, but wondered.)
(2) does your application load classes, or intern a lot of strings?
If i am reading the logs right, G1 appears to reclaim less
and less of the heap in each cycle until a full collection
intervenes, and I have no real explanation for this behaviour
except that perhaps there's something in the perm gen that
keeps stuff in the remainder of the heap artificially live.
G1 does not incrementally collect the young gen, so this is
plausible. But CMS does not either by default and I do not see
that option in the CMS options list you gave below. It would
be instructive to see what the comparable CMS logs look like.
May be then you could start with the same heap shapes for the
two and see if you can get to the bottom of the full gc (which
as i understand you get to more quickly w/G1 than you did
w/CMS).
-- ramki
On 07/06/10 14:24, Todd Lipcon wrote:
> On Tue, Jul 6, 2010 at 2:09 PM, Jon Masamitsu <
> > wrote:
>
> Todd,
>
> Could you send a segment of the GC logs from the beginning
> through the first dozen or so full GC's?
>
>
> Sure, I just put it online at:
>
> http://cloudera-todd.s3.amazonaws.com/gc-log-g1gc.txt
>
>
>
> Exactly which version of the JVM are you using?
>
> java -version
>
> will tell us.
>
>
> Latest as of last night:
>
> [todd@monster01 ~]$ ./jdk1.7.0/jre/bin/java -version
> java version "1.7.0-ea"
> Java(TM) SE Runtime Environment (build 1.7.0-ea-b99)
> Java HotSpot(TM) 64-Bit Server VM (build 19.0-b03, mixed mode)
>
>
> Do you have a test setup where you could do some experiments?
>
>
> Sure, I have a five node cluster here where I do lots of testing, happy
> to try different builds/options/etc (though I probably don't have time
> to apply patches and rebuild the JDK myself)
>
>
> Can you send the set of CMS flags you use? It might tell
> us something about the GC behavior of you application.
> Might not tell us anything but it's worth a look.
>
>
> Different customers have found different flags to work well for them.
> One user uses the following:
>
>
> -Xmx12000m -XX:+HeapDumpOnOutOfMemoryError -XX:+UseConcMarkSweepGC \
> -XX:NewSize=64m -XX:MaxNewSize=64m -XX:CMSInitiatingOccupancyFraction=40 \
>
>
> Another uses:
>
>
> -XX:+DoEscapeAnalysis -XX:+AggressiveOpts -XX:+UseConcMarkSweepGC
> -XX:NewSize=64m -XX:MaxNewSize=64m
> -XX:CMSInitiatingOccupancyFraction=88 -verbose:gc -XX:+PrintGCDetails
>
>
>
>
>
>
>
> The particular tuning options probably depend on the actual cache
> workload of the user. I tend to recommend CMSInitiatingOccupancyFraction
> around 75 or so, since the software maintains about 60% heap usage. I
> also think a NewSize slightly larger would improve things a bit, but if
> it gets more than 256m or so, the ParNew pauses start to be too long for
> a lot of use cases.
>
> Regarding CMS logs, I can probably restart this test later this
> afternoon on CMS and run it for a couple hours, but it isn't likely to
> hit the multi-minute compaction that quickly. It happens more in the wild.
>
> -Todd
>
>
>
> On 07/06/10 13:27, Todd Lipcon wrote:
>> Hi all,
>>
>> I work on HBase, a distributed database written in Java. We
>> generally run on large heaps (8GB+), and our object lifetime
>> distribution has proven pretty problematic for garbage collection
>> (we manage a multi-GB LRU cache inside the process, so in CMS we
>> tenure a lot of byte arrays which later get collected).
>>
>> In Java6, we generally run with the CMS collector, tuning down
>> CMSInitiatingOccupancyFraction and constraining MaxNewSize to
>> achieve fairly low pause GC, but after a week or two of uptime we
>> often run into full heap compaction which takes several minutes
>> and wreaks havoc on the system.
>>
>> Needless to say, we've been watching the development of the G1 GC
>> with anticipation for the last year or two. Finally it seems in
>> the latest build of JDK7 it's stable enough for actual use
>> (previously it would segfault within an hour or so). However, in
>> my testing I'm seeing a fair amount of 8-10 second Full GC pauses.
>>
>> The flags I'm passing are:
>>
>> -Xmx8000m -XX:+UseG1GC -XX:MaxGCPauseMillis=20
>> -XX:GCPauseIntervalMillis=80
>>
>> Most of the pauses I see in the GC log are around 10-20ms as expected:
>>
>> 2010-07-05T22:43:19.849-0700: 1680.079: [GC pause (partial),
>> 0.01209000 secs]
>> [Parallel Time: 10.5 ms]
>> [Update RS (Start) (ms): 1680080.2 1680080.1 1680080.2
>> 1680079.9 1680080.0 1680080.2 1680080.1 1680080.1 1680080.0
>> 1680080.1 1680080.0 1680079.9 1680081.5]
>> [Update RS (ms): 1.4 2.0 2.2 1.8 1.7 1.4 2.5 2.2
>> 1.9 2.5 1.7 1.7 0.1
>> Avg: 1.8, Min: 0.1, Max: 2.5]
>> [Processed Buffers : 8 1 3 1 1 7 3 2 6 2 7 8 3
>> Sum: 52, Avg: 4, Min: 1, Max: 8]
>> [Ext Root Scanning (ms): 0.7 0.5 0.5 0.3 0.5 0.7 0.4
>> 0.5 0.4 0.5 0.3 0.3 0.0
>> Avg: 0.4, Min: 0.0, Max: 0.7]
>> [Mark Stack Scanning (ms): 0.0 0.0 0.0 0.0 0.0 0.0
>> 0.0 0.0 0.0 0.0 0.0 0.0 0.0
>> Avg: 0.0, Min: 0.0, Max: 0.0]
>> [Scan RS (ms): 0.9 0.4 0.1 0.7 0.7 0.9 0.0 0.1 0.6
>> 0.0 0.8 0.8 0.9
>> Avg: 0.5, Min: 0.0, Max: 0.9]
>> [Object Copy (ms): 7.2 7.2 7.3 7.3 7.1 7.1 7.0 7.2
>> 7.1 6.9 7.1 7.1 7.0
>> Avg: 7.1, Min: 6.9, Max: 7.3]
>> [Termination (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
>> 0.0 0.0 0.0 0.0 0.0
>> Avg: 0.0, Min: 0.0, Max: 0.0]
>> [Other: 0.6 ms]
>> [Clear CT: 0.5 ms]
>> [Other: 1.1 ms]
>> [Choose CSet: 0.0 ms]
>> [ 7677M->7636M(8000M)]
>> [Times: user=0.12 sys=0.00, real=0.01 secs]
>>
>> But every 5-10 minutes I see a GC pause that lasts 10-15 seconds:
>> [todd@monster01 logs]$ grep 'Full GC' gc-hbase.log | tail
>> 2010-07-06T12:50:41.216-0700: 52521.446: [Full GC
>> 7934M->4865M(8000M), 9.8907800 secs]
>> 2010-07-06T12:55:39.802-0700: 52820.032: [Full GC
>> 7930M->4964M(8000M), 9.9025520 secs]
>> 2010-07-06T13:02:26.872-0700: 53227.102: [Full GC
>> 7934M->4882M(8000M), 10.1232190 secs]
>> 2010-07-06T13:09:41.049-0700: 53661.279: [Full GC
>> 7938M->5002M(8000M), 10.4997760 secs]
>> 2010-07-06T13:18:51.531-0700: 54211.761: [Full GC
>> 7938M->4962M(8000M), 11.0497380 secs]
>>
>> These pauses are pretty unacceptable for soft real time operation.
>>
>> Am I missing some tuning that should be done for G1GC for
>> applications like this? Is 20ms out of 80ms too aggressive a
>> target for the garbage rates we're generating?
>>
>> My actual live heap usage should be very stable around 5GB - the
>> application very carefully accounts its live object set at around
>> 60% of the max heap (as you can see in the logs above).
>>
>> At this point we are considering doing crazy things like ripping
>> out our main memory consumers into a custom slab allocator, and
>> manually reference count the byte array slices. But, if we can get
>> G1GC to work for us, it will save a lot of engineering on the
>> application side!
>>
>> Thanks
>> -Todd
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>> ------------------------------------------------------------------------
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
On 07/07/10 08:45, Todd Lipcon wrote:
...
>
> Overnight I saw one "concurrent mode failure".
...
> 2010-07-07T07:56:27.786-0700: 28490.203: [GC 28490.203: [ParNew
> (promotion failed): 59008K->59008K(59008K), 0.0179250 secs]28490.221:
> [CMS2010-07-07T07:56:27.901-0700: 28490.317: [CMS-concurrent-preclean:
> 0.556/0.947 secs] [Times:
> user=5.76 sys=0.26, real=0.95 secs]
> (concurrent mode failure): 6359176K->4206871K(8323072K), 17.4366220
> secs] 6417373K->4206871K(8382080K), [CMS Perm : 18609K->18565K(31048K)],
> 17.4546890 secs] [Times: user=11.17 sys=0.09, real=17.45 secs]
>
> I've interpreted pauses like this as being caused by fragmentation,
> since the young gen is 64M, and the old gen here has about 2G free. If
> there's something I'm not understanding about CMS, and I can tune it
> more smartly to avoid these longer pauses, I'm happy to try.
Yes the old gen must be fragmented. I'll look at the data you have
made available (for CMS). The CMS log you uploaded does not have the
suffix leading into the concurrent mode failure ypu display above
(it stops less than 2500 s into the run). If you could include
the entire log leading into the concurrent mode failures, it would
be a great help. Do you have large arrays in your
application? The shape of the promotion graph for CMS is somewhat
jagged, indicating _perhaps_ that. Yes, +PrintTenuringDistribution
would shed a bit more light. As regards fragmentation, it can be
tricky to tune against, but we can try once we understand a bit
more about the object sizes and demographics.
I am sure you don't have an easily shared test case, so we
can reproduce both the CMS fragmentation and the G1 full gc
issues locally for quickest progress on this?
-- ramki
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
|
# 6

08-07-2010 01:26 AM
|
|
|
Todd,
Could you send a segment of the GC logs from the beginning
through the first dozen or so full GC's?
Exactly which version of the JVM are you using?
java -version
will tell us.
Do you have a test setup where you could do some experiments?
Can you send the set of CMS flags you use? It might tell
us something about the GC behavior of you application.
Might not tell us anything but it's worth a look.
Jon
On 07/06/10 13:27, Todd Lipcon wrote:
> Hi all,
>
> I work on HBase, a distributed database written in Java. We generally
> run on large heaps (8GB+), and our object lifetime distribution has
> proven pretty problematic for garbage collection (we manage a multi-GB
> LRU cache inside the process, so in CMS we tenure a lot of byte arrays
> which later get collected).
>
> In Java6, we generally run with the CMS collector, tuning down
> CMSInitiatingOccupancyFraction and constraining MaxNewSize to achieve
> fairly low pause GC, but after a week or two of uptime we often run
> into full heap compaction which takes several minutes and wreaks havoc
> on the system.
>
> Needless to say, we've been watching the development of the G1 GC with
> anticipation for the last year or two. Finally it seems in the latest
> build of JDK7 it's stable enough for actual use (previously it would
> segfault within an hour or so). However, in my testing I'm seeing a
> fair amount of 8-10 second Full GC pauses.
>
> The flags I'm passing are:
>
> -Xmx8000m -XX:+UseG1GC -XX:MaxGCPauseMillis=20
> -XX:GCPauseIntervalMillis=80
>
> Most of the pauses I see in the GC log are around 10-20ms as expected:
>
> 2010-07-05T22:43:19.849-0700: 1680.079: [GC pause (partial),
> 0.01209000 secs]
> [Parallel Time: 10.5 ms]
> [Update RS (Start) (ms): 1680080.2 1680080.1 1680080.2
> 1680079.9 1680080.0 1680080.2 1680080.1 1680080.1 1680080.0
> 1680080.1 1680080.0 1680079.9 1680081.5]
> [Update RS (ms): 1.4 2.0 2.2 1.8 1.7 1.4 2.5 2.2 1.9
> 2.5 1.7 1.7 0.1
> Avg: 1.8, Min: 0.1, Max: 2.5]
> [Processed Buffers : 8 1 3 1 1 7 3 2 6 2 7 8 3
> Sum: 52, Avg: 4, Min: 1, Max: 8]
> [Ext Root Scanning (ms): 0.7 0.5 0.5 0.3 0.5 0.7 0.4 0.5
> 0.4 0.5 0.3 0.3 0.0
> Avg: 0.4, Min: 0.0, Max: 0.7]
> [Mark Stack Scanning (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0
> 0.0 0.0 0.0 0.0 0.0 0.0
> Avg: 0.0, Min: 0.0, Max: 0.0]
> [Scan RS (ms): 0.9 0.4 0.1 0.7 0.7 0.9 0.0 0.1 0.6 0.0
> 0.8 0.8 0.9
> Avg: 0.5, Min: 0.0, Max: 0.9]
> [Object Copy (ms): 7.2 7.2 7.3 7.3 7.1 7.1 7.0 7.2 7.1
> 6.9 7.1 7.1 7.0
> Avg: 7.1, Min: 6.9, Max: 7.3]
> [Termination (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
> 0.0 0.0 0.0 0.0
> Avg: 0.0, Min: 0.0, Max: 0.0]
> [Other: 0.6 ms]
> [Clear CT: 0.5 ms]
> [Other: 1.1 ms]
> [Choose CSet: 0.0 ms]
> [ 7677M->7636M(8000M)]
> [Times: user=0.12 sys=0.00, real=0.01 secs]
>
> But every 5-10 minutes I see a GC pause that lasts 10-15 seconds:
> [todd@monster01 logs]$ grep 'Full GC' gc-hbase.log | tail
> 2010-07-06T12:50:41.216-0700: 52521.446: [Full GC 7934M->4865M(8000M),
> 9.8907800 secs]
> 2010-07-06T12:55:39.802-0700: 52820.032: [Full GC 7930M->4964M(8000M),
> 9.9025520 secs]
> 2010-07-06T13:02:26.872-0700: 53227.102: [Full GC 7934M->4882M(8000M),
> 10.1232190 secs]
> 2010-07-06T13:09:41.049-0700: 53661.279: [Full GC 7938M->5002M(8000M),
> 10.4997760 secs]
> 2010-07-06T13:18:51.531-0700: 54211.761: [Full GC 7938M->4962M(8000M),
> 11.0497380 secs]
>
> These pauses are pretty unacceptable for soft real time operation.
>
> Am I missing some tuning that should be done for G1GC for applications
> like this? Is 20ms out of 80ms too aggressive a target for the garbage
> rates we're generating?
>
> My actual live heap usage should be very stable around 5GB - the
> application very carefully accounts its live object set at around 60%
> of the max heap (as you can see in the logs above).
>
> At this point we are considering doing crazy things like ripping out
> our main memory consumers into a custom slab allocator, and manually
> reference count the byte array slices. But, if we can get G1GC to work
> for us, it will save a lot of engineering on the application side!
>
> Thanks
> -Todd
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
> ------------------------------------------------------------------------
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
Did you try doubling the heap size? You might want to post a full
log so we can see what's happening between those full collections.
Also, If you have comparable CMS logs
all the better, as a known starting point. The full gc's almost
look like the heap got too full, so it must mean that incremental
collection is not keeping up with the rate of garbage generation.
Also, what's the JDK build you are running?
-- ramki
On 07/06/10 13:27, Todd Lipcon wrote:
> Hi all,
>
> I work on HBase, a distributed database written in Java. We generally
> run on large heaps (8GB+), and our object lifetime distribution has
> proven pretty problematic for garbage collection (we manage a multi-GB
> LRU cache inside the process, so in CMS we tenure a lot of byte arrays
> which later get collected).
>
> In Java6, we generally run with the CMS collector, tuning down
> CMSInitiatingOccupancyFraction and constraining MaxNewSize to achieve
> fairly low pause GC, but after a week or two of uptime we often run into
> full heap compaction which takes several minutes and wreaks havoc on the
> system.
>
> Needless to say, we've been watching the development of the G1 GC with
> anticipation for the last year or two. Finally it seems in the latest
> build of JDK7 it's stable enough for actual use (previously it would
> segfault within an hour or so). However, in my testing I'm seeing a fair
> amount of 8-10 second Full GC pauses.
>
> The flags I'm passing are:
>
> -Xmx8000m -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:GCPauseIntervalMillis=80
>
> Most of the pauses I see in the GC log are around 10-20ms as expected:
>
> 2010-07-05T22:43:19.849-0700: 1680.079: [GC pause (partial), 0.01209000
> secs]
> [Parallel Time: 10.5 ms]
> [Update RS (Start) (ms): 1680080.2 1680080.1 1680080.2
> 1680079.9 1680080.0 1680080.2 1680080.1 1680080.1 1680080.0
> 1680080.1 1680080.0 1680079.9 1680081.5]
> [Update RS (ms): 1.4 2.0 2.2 1.8 1.7 1.4 2.5 2.2 1.9 2.5
> 1.7 1.7 0.1
> Avg: 1.8, Min: 0.1, Max: 2.5]
> [Processed Buffers : 8 1 3 1 1 7 3 2 6 2 7 8 3
> Sum: 52, Avg: 4, Min: 1, Max: 8]
> [Ext Root Scanning (ms): 0.7 0.5 0.5 0.3 0.5 0.7 0.4 0.5
> 0.4 0.5 0.3 0.3 0.0
> Avg: 0.4, Min: 0.0, Max: 0.7]
> [Mark Stack Scanning (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
> 0.0 0.0 0.0 0.0 0.0
> Avg: 0.0, Min: 0.0, Max: 0.0]
> [Scan RS (ms): 0.9 0.4 0.1 0.7 0.7 0.9 0.0 0.1 0.6 0.0
> 0.8 0.8 0.9
> Avg: 0.5, Min: 0.0, Max: 0.9]
> [Object Copy (ms): 7.2 7.2 7.3 7.3 7.1 7.1 7.0 7.2 7.1
> 6.9 7.1 7.1 7.0
> Avg: 7.1, Min: 6.9, Max: 7.3]
> [Termination (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
> 0.0 0.0 0.0 0.0
> Avg: 0.0, Min: 0.0, Max: 0.0]
> [Other: 0.6 ms]
> [Clear CT: 0.5 ms]
> [Other: 1.1 ms]
> [Choose CSet: 0.0 ms]
> [ 7677M->7636M(8000M)]
> [Times: user=0.12 sys=0.00, real=0.01 secs]
>
> But every 5-10 minutes I see a GC pause that lasts 10-15 seconds:
> [todd@monster01 logs]$ grep 'Full GC' gc-hbase.log | tail
> 2010-07-06T12:50:41.216-0700: 52521.446: [Full GC 7934M->4865M(8000M),
> 9.8907800 secs]
> 2010-07-06T12:55:39.802-0700: 52820.032: [Full GC 7930M->4964M(8000M),
> 9.9025520 secs]
> 2010-07-06T13:02:26.872-0700: 53227.102: [Full GC 7934M->4882M(8000M),
> 10.1232190 secs]
> 2010-07-06T13:09:41.049-0700: 53661.279: [Full GC 7938M->5002M(8000M),
> 10.4997760 secs]
> 2010-07-06T13:18:51.531-0700: 54211.761: [Full GC 7938M->4962M(8000M),
> 11.0497380 secs]
>
> These pauses are pretty unacceptable for soft real time operation.
>
> Am I missing some tuning that should be done for G1GC for applications
> like this? Is 20ms out of 80ms too aggressive a target for the garbage
> rates we're generating?
>
> My actual live heap usage should be very stable around 5GB - the
> application very carefully accounts its live object set at around 60% of
> the max heap (as you can see in the logs above).
>
> At this point we are considering doing crazy things like ripping out our
> main memory consumers into a custom slab allocator, and manually
> reference count the byte array slices. But, if we can get G1GC to work
> for us, it will save a lot of engineering on the application side!
>
> Thanks
> -Todd
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>
>
> ------------------------------------------------------------------------
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
I also work with Todd on this systems (I am one of the other people
with the alternate CMS config) and doubling the heap size from 8 GB to
16GB is a little insane... we'd like to have some amount of reasonable
memory efficiency here... The thing is the more we can get out of our
ram for this block cache, the better performing our systems are. Also
a lot of the settings are self tuning, so if we up the Xmx the size of
the block cache is scaled as well.
-ryan
On Tue, Jul 6, 2010 at 2:12 PM, Y. S. Ramakrishna
<> wrote:
> Did you try doubling the heap size? You might want to post a full
> log so we can see what's happening between those full collections.
> Also, If you have comparable CMS logs
> all the better, as a known starting point. The full gc's almost
> look like the heap got too full, so it must mean that incremental
> collection is not keeping up with the rate of garbage generation.
> Also, what's the JDK build you are running?
>
> -- ramki
>
> On 07/06/10 13:27, Todd Lipcon wrote:
>> Hi all,
>>
>> I work on HBase, a distributed database written in Java. We generally
>> run on large heaps (8GB+), and our object lifetime distribution has
>> proven pretty problematic for garbage collection (we manage a multi-GB
>> LRU cache inside the process, so in CMS we tenure a lot of byte arrays
>> which later get collected).
>>
>> In Java6, we generally run with the CMS collector, tuning down
>> CMSInitiatingOccupancyFraction and constraining MaxNewSize to achieve
>> fairly low pause GC, but after a week or two of uptime we often run into
>> full heap compaction which takes several minutes and wreaks havoc on the
>> system.
>>
>> Needless to say, we've been watching the development of the G1 GC with
>> anticipation for the last year or two. Finally it seems in the latest
>> build of JDK7 it's stable enough for actual use (previously it would
>> segfault within an hour or so). However, in my testing I'm seeing a fair
>> amount of 8-10 second Full GC pauses.
>>
>> The flags I'm passing are:
>>
>> -Xmx8000m -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:GCPauseIntervalMillis=80
>>
>> Most of the pauses I see in the GC log are around 10-20ms as expected:
>>
>> 2010-07-05T22:43:19.849-0700: 1680.079: [GC pause (partial), 0.01209000
>> secs]
>> [Parallel Time: 10.5 ms]
>> [Update RS (Start) (ms): 1680080.2 1680080.1 1680080.2
>> 1680079.9 1680080.0 1680080.2 1680080.1 1680080.1 1680080.0
>> 1680080.1 1680080.0 1680079.9 1680081.5]
>> [Update RS (ms): 1.4 2.0 2.2 1.8 1.7 1.4 2.5 2.2 1.9 2.5
>> 1.7 1.7 0.1
>> Avg: 1.8, Min: 0.1, Max: 2.5]
>> [Processed Buffers : 8 1 3 1 1 7 3 2 6 2 7 8 3
>> Sum: 52, Avg: 4, Min: 1, Max: 8]
>> [Ext Root Scanning (ms): 0.7 0.5 0.5 0.3 0.5 0.7 0.4 0.5
>> 0.4 0.5 0.3 0.3 0.0
>> Avg: 0.4, Min: 0.0, Max: 0.7]
>> [Mark Stack Scanning (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
>> 0.0 0.0 0.0 0.0 0.0
>> Avg: 0.0, Min: 0.0, Max: 0.0]
>> [Scan RS (ms): 0.9 0.4 0.1 0.7 0.7 0.9 0.0 0.1 0.6 0.0
>> 0.8 0.8 0.9
>> Avg: 0.5, Min: 0.0, Max: 0.9]
>> [Object Copy (ms): 7.2 7.2 7.3 7.3 7.1 7.1 7.0 7.2 7.1
>> 6.9 7.1 7.1 7.0
>> Avg: 7.1, Min: 6.9, Max: 7.3]
>> [Termination (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
>> 0.0 0.0 0.0 0.0
>> Avg: 0.0, Min: 0.0, Max: 0.0]
>> [Other: 0.6 ms]
>> [Clear CT: 0.5 ms]
>> [Other: 1.1 ms]
>> [Choose CSet: 0.0 ms]
>> [ 7677M->7636M(8000M)]
>> [Times: user=0.12 sys=0.00, real=0.01 secs]
>>
>> But every 5-10 minutes I see a GC pause that lasts 10-15 seconds:
>> [todd@monster01 logs]$ grep 'Full GC' gc-hbase.log | tail
>> 2010-07-06T12:50:41.216-0700: 52521.446: [Full GC 7934M->4865M(8000M),
>> 9.8907800 secs]
>> 2010-07-06T12:55:39.802-0700: 52820.032: [Full GC 7930M->4964M(8000M),
>> 9.9025520 secs]
>> 2010-07-06T13:02:26.872-0700: 53227.102: [Full GC 7934M->4882M(8000M),
>> 10.1232190 secs]
>> 2010-07-06T13:09:41.049-0700: 53661.279: [Full GC 7938M->5002M(8000M),
>> 10.4997760 secs]
>> 2010-07-06T13:18:51.531-0700: 54211.761: [Full GC 7938M->4962M(8000M),
>> 11.0497380 secs]
>>
>> These pauses are pretty unacceptable for soft real time operation.
>>
>> Am I missing some tuning that should be done for G1GC for applications
>> like this? Is 20ms out of 80ms too aggressive a target for the garbage
>> rates we're generating?
>>
>> My actual live heap usage should be very stable around 5GB - the
>> application very carefully accounts its live object set at around 60% of
>> the max heap (as you can see in the logs above).
>>
>> At this point we are considering doing crazy things like ripping out our
>> main memory consumers into a custom slab allocator, and manually
>> reference count the byte array slices. But, if we can get G1GC to work
>> for us, it will save a lot of engineering on the application side!
>>
>> Thanks
>> -Todd
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
>>
>> ------------------------------------------------------------------------
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
Two questions:
(1) do you enable class unloading with CMS? (I do not see that
below in yr option list, but wondered.)
(2) does your application load classes, or intern a lot of strings?
If i am reading the logs right, G1 appears to reclaim less
and less of the heap in each cycle until a full collection
intervenes, and I have no real explanation for this behaviour
except that perhaps there's something in the perm gen that
keeps stuff in the remainder of the heap artificially live.
G1 does not incrementally collect the young gen, so this is
plausible. But CMS does not either by default and I do not see
that option in the CMS options list you gave below. It would
be instructive to see what the comparable CMS logs look like.
May be then you could start with the same heap shapes for the
two and see if you can get to the bottom of the full gc (which
as i understand you get to more quickly w/G1 than you did
w/CMS).
-- ramki
On 07/06/10 14:24, Todd Lipcon wrote:
> On Tue, Jul 6, 2010 at 2:09 PM, Jon Masamitsu <
> > wrote:
>
> Todd,
>
> Could you send a segment of the GC logs from the beginning
> through the first dozen or so full GC's?
>
>
> Sure, I just put it online at:
>
> http://cloudera-todd.s3.amazonaws.com/gc-log-g1gc.txt
>
>
>
> Exactly which version of the JVM are you using?
>
> java -version
>
> will tell us.
>
>
> Latest as of last night:
>
> [todd@monster01 ~]$ ./jdk1.7.0/jre/bin/java -version
> java version "1.7.0-ea"
> Java(TM) SE Runtime Environment (build 1.7.0-ea-b99)
> Java HotSpot(TM) 64-Bit Server VM (build 19.0-b03, mixed mode)
>
>
> Do you have a test setup where you could do some experiments?
>
>
> Sure, I have a five node cluster here where I do lots of testing, happy
> to try different builds/options/etc (though I probably don't have time
> to apply patches and rebuild the JDK myself)
>
>
> Can you send the set of CMS flags you use? It might tell
> us something about the GC behavior of you application.
> Might not tell us anything but it's worth a look.
>
>
> Different customers have found different flags to work well for them.
> One user uses the following:
>
>
> -Xmx12000m -XX:+HeapDumpOnOutOfMemoryError -XX:+UseConcMarkSweepGC \
> -XX:NewSize=64m -XX:MaxNewSize=64m -XX:CMSInitiatingOccupancyFraction=40 \
>
>
> Another uses:
>
>
> -XX:+DoEscapeAnalysis -XX:+AggressiveOpts -XX:+UseConcMarkSweepGC
> -XX:NewSize=64m -XX:MaxNewSize=64m
> -XX:CMSInitiatingOccupancyFraction=88 -verbose:gc -XX:+PrintGCDetails
>
>
>
>
>
>
>
> The particular tuning options probably depend on the actual cache
> workload of the user. I tend to recommend CMSInitiatingOccupancyFraction
> around 75 or so, since the software maintains about 60% heap usage. I
> also think a NewSize slightly larger would improve things a bit, but if
> it gets more than 256m or so, the ParNew pauses start to be too long for
> a lot of use cases.
>
> Regarding CMS logs, I can probably restart this test later this
> afternoon on CMS and run it for a couple hours, but it isn't likely to
> hit the multi-minute compaction that quickly. It happens more in the wild.
>
> -Todd
>
>
>
> On 07/06/10 13:27, Todd Lipcon wrote:
>> Hi all,
>>
>> I work on HBase, a distributed database written in Java. We
>> generally run on large heaps (8GB+), and our object lifetime
>> distribution has proven pretty problematic for garbage collection
>> (we manage a multi-GB LRU cache inside the process, so in CMS we
>> tenure a lot of byte arrays which later get collected).
>>
>> In Java6, we generally run with the CMS collector, tuning down
>> CMSInitiatingOccupancyFraction and constraining MaxNewSize to
>> achieve fairly low pause GC, but after a week or two of uptime we
>> often run into full heap compaction which takes several minutes
>> and wreaks havoc on the system.
>>
>> Needless to say, we've been watching the development of the G1 GC
>> with anticipation for the last year or two. Finally it seems in
>> the latest build of JDK7 it's stable enough for actual use
>> (previously it would segfault within an hour or so). However, in
>> my testing I'm seeing a fair amount of 8-10 second Full GC pauses.
>>
>> The flags I'm passing are:
>>
>> -Xmx8000m -XX:+UseG1GC -XX:MaxGCPauseMillis=20
>> -XX:GCPauseIntervalMillis=80
>>
>> Most of the pauses I see in the GC log are around 10-20ms as expected:
>>
>> 2010-07-05T22:43:19.849-0700: 1680.079: [GC pause (partial),
>> 0.01209000 secs]
>> [Parallel Time: 10.5 ms]
>> [Update RS (Start) (ms): 1680080.2 1680080.1 1680080.2
>> 1680079.9 1680080.0 1680080.2 1680080.1 1680080.1 1680080.0
>> 1680080.1 1680080.0 1680079.9 1680081.5]
>> [Update RS (ms): 1.4 2.0 2.2 1.8 1.7 1.4 2.5 2.2
>> 1.9 2.5 1.7 1.7 0.1
>> Avg: 1.8, Min: 0.1, Max: 2.5]
>> [Processed Buffers : 8 1 3 1 1 7 3 2 6 2 7 8 3
>> Sum: 52, Avg: 4, Min: 1, Max: 8]
>> [Ext Root Scanning (ms): 0.7 0.5 0.5 0.3 0.5 0.7 0.4
>> 0.5 0.4 0.5 0.3 0.3 0.0
>> Avg: 0.4, Min: 0.0, Max: 0.7]
>> [Mark Stack Scanning (ms): 0.0 0.0 0.0 0.0 0.0 0.0
>> 0.0 0.0 0.0 0.0 0.0 0.0 0.0
>> Avg: 0.0, Min: 0.0, Max: 0.0]
>> [Scan RS (ms): 0.9 0.4 0.1 0.7 0.7 0.9 0.0 0.1 0.6
>> 0.0 0.8 0.8 0.9
>> Avg: 0.5, Min: 0.0, Max: 0.9]
>> [Object Copy (ms): 7.2 7.2 7.3 7.3 7.1 7.1 7.0 7.2
>> 7.1 6.9 7.1 7.1 7.0
>> Avg: 7.1, Min: 6.9, Max: 7.3]
>> [Termination (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
>> 0.0 0.0 0.0 0.0 0.0
>> Avg: 0.0, Min: 0.0, Max: 0.0]
>> [Other: 0.6 ms]
>> [Clear CT: 0.5 ms]
>> [Other: 1.1 ms]
>> [Choose CSet: 0.0 ms]
>> [ 7677M->7636M(8000M)]
>> [Times: user=0.12 sys=0.00, real=0.01 secs]
>>
>> But every 5-10 minutes I see a GC pause that lasts 10-15 seconds:
>> [todd@monster01 logs]$ grep 'Full GC' gc-hbase.log | tail
>> 2010-07-06T12:50:41.216-0700: 52521.446: [Full GC
>> 7934M->4865M(8000M), 9.8907800 secs]
>> 2010-07-06T12:55:39.802-0700: 52820.032: [Full GC
>> 7930M->4964M(8000M), 9.9025520 secs]
>> 2010-07-06T13:02:26.872-0700: 53227.102: [Full GC
>> 7934M->4882M(8000M), 10.1232190 secs]
>> 2010-07-06T13:09:41.049-0700: 53661.279: [Full GC
>> 7938M->5002M(8000M), 10.4997760 secs]
>> 2010-07-06T13:18:51.531-0700: 54211.761: [Full GC
>> 7938M->4962M(8000M), 11.0497380 secs]
>>
>> These pauses are pretty unacceptable for soft real time operation.
>>
>> Am I missing some tuning that should be done for G1GC for
>> applications like this? Is 20ms out of 80ms too aggressive a
>> target for the garbage rates we're generating?
>>
>> My actual live heap usage should be very stable around 5GB - the
>> application very carefully accounts its live object set at around
>> 60% of the max heap (as you can see in the logs above).
>>
>> At this point we are considering doing crazy things like ripping
>> out our main memory consumers into a custom slab allocator, and
>> manually reference count the byte array slices. But, if we can get
>> G1GC to work for us, it will save a lot of engineering on the
>> application side!
>>
>> Thanks
>> -Todd
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>> ------------------------------------------------------------------------
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
On 07/07/10 08:45, Todd Lipcon wrote:
...
>
> Overnight I saw one "concurrent mode failure".
...
> 2010-07-07T07:56:27.786-0700: 28490.203: [GC 28490.203: [ParNew
> (promotion failed): 59008K->59008K(59008K), 0.0179250 secs]28490.221:
> [CMS2010-07-07T07:56:27.901-0700: 28490.317: [CMS-concurrent-preclean:
> 0.556/0.947 secs] [Times:
> user=5.76 sys=0.26, real=0.95 secs]
> (concurrent mode failure): 6359176K->4206871K(8323072K), 17.4366220
> secs] 6417373K->4206871K(8382080K), [CMS Perm : 18609K->18565K(31048K)],
> 17.4546890 secs] [Times: user=11.17 sys=0.09, real=17.45 secs]
>
> I've interpreted pauses like this as being caused by fragmentation,
> since the young gen is 64M, and the old gen here has about 2G free. If
> there's something I'm not understanding about CMS, and I can tune it
> more smartly to avoid these longer pauses, I'm happy to try.
Yes the old gen must be fragmented. I'll look at the data you have
made available (for CMS). The CMS log you uploaded does not have the
suffix leading into the concurrent mode failure ypu display above
(it stops less than 2500 s into the run). If you could include
the entire log leading into the concurrent mode failures, it would
be a great help. Do you have large arrays in your
application? The shape of the promotion graph for CMS is somewhat
jagged, indicating _perhaps_ that. Yes, +PrintTenuringDistribution
would shed a bit more light. As regards fragmentation, it can be
tricky to tune against, but we can try once we understand a bit
more about the object sizes and demographics.
I am sure you don't have an easily shared test case, so we
can reproduce both the CMS fragmentation and the G1 full gc
issues locally for quickest progress on this?
-- ramki
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
On 07/07/10 17:18, Todd Lipcon wrote:
...
> Looking at the graph you attached, it appears that the low-water mark
> stabilizes at somewhere between 4.5G and 5G. The configuration I'm
> running is to allocate 40% of the heap to Memstore and 20% of the heap
> to the LRU cache. For an 8G heap, this is 4.8GB. So, for this
> application it's somewhat expected that, as it runs, it will accumulate
> more and more data until it reaches this threshold. The data is, of
> course, not *permanent*, but it's reasonably long-lived, so it makes
> sense to me that it should go into the old generation.
Ah, i see. In that case, i think you could try using a slightly larger old
gen. If the old gen stabilizes at 4.2 GB, we should allow as much for slop.
i.e. make the old gen 8.4 GB (or whatever is the measured stable
old gen occupancy), then add to that the young gen size, and use
that for the whole heap. I would be even more aggressive
and grant more to the old gen -- as i said earlier perhaps
double the old gen from its present size. If that doesn;t work
we know that something is amiss in the way we are going at this.
If it works, we can iterate downwards from a config that we know
works, down to what may be considered an acceptable space overhead
for GC.
>
> If you like, I can tune down those percentages to 20/20 instead of
> 20/40, and I think we'll see the same pattern, just stabilized around
> 3.2GB. This will probably delay the full GCs, but still eventually hit
> them. It's also way lower than we can really go - customers won't like
> "throwing away" 60% of the allocated heap to GC!
I understand that sentiment. I want us to get to a state where we are able
to completely avoid the creeping fragmentation, if possible. There are
other ways to tune for this, but they are more labour-intensive and tricky,
and I would not want to go into that lightly. You might want to contact
your Java support for help with that.
>
>
> Perhaps jmap -histo:live or +PrintClassHistogram[AfterFullGC]
> will help you get to the bottom of yr leak. Once the leak is plugged
> perhaps we could come back to the G1 tuning effort? (We have some
> guesses as to what might be happening and the best G1 minds are
> chewing on the info you provided so far, for which thanks!)
>
>
> I can try running with those options and see what I see, but I've
> already spent some time looking at heap dumps, and not found any leaks,
> so I'm pretty sure it's not the issue.
OK, in that case it's not worth doing, since you've already ruled
out leaks.
I'll think some more about this meanwhile.
thanks.
-- ramki
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
|
# 7

08-07-2010 05:46 PM
|
|
|
Todd,
Could you send a segment of the GC logs from the beginning
through the first dozen or so full GC's?
Exactly which version of the JVM are you using?
java -version
will tell us.
Do you have a test setup where you could do some experiments?
Can you send the set of CMS flags you use? It might tell
us something about the GC behavior of you application.
Might not tell us anything but it's worth a look.
Jon
On 07/06/10 13:27, Todd Lipcon wrote:
> Hi all,
>
> I work on HBase, a distributed database written in Java. We generally
> run on large heaps (8GB+), and our object lifetime distribution has
> proven pretty problematic for garbage collection (we manage a multi-GB
> LRU cache inside the process, so in CMS we tenure a lot of byte arrays
> which later get collected).
>
> In Java6, we generally run with the CMS collector, tuning down
> CMSInitiatingOccupancyFraction and constraining MaxNewSize to achieve
> fairly low pause GC, but after a week or two of uptime we often run
> into full heap compaction which takes several minutes and wreaks havoc
> on the system.
>
> Needless to say, we've been watching the development of the G1 GC with
> anticipation for the last year or two. Finally it seems in the latest
> build of JDK7 it's stable enough for actual use (previously it would
> segfault within an hour or so). However, in my testing I'm seeing a
> fair amount of 8-10 second Full GC pauses.
>
> The flags I'm passing are:
>
> -Xmx8000m -XX:+UseG1GC -XX:MaxGCPauseMillis=20
> -XX:GCPauseIntervalMillis=80
>
> Most of the pauses I see in the GC log are around 10-20ms as expected:
>
> 2010-07-05T22:43:19.849-0700: 1680.079: [GC pause (partial),
> 0.01209000 secs]
> [Parallel Time: 10.5 ms]
> [Update RS (Start) (ms): 1680080.2 1680080.1 1680080.2
> 1680079.9 1680080.0 1680080.2 1680080.1 1680080.1 1680080.0
> 1680080.1 1680080.0 1680079.9 1680081.5]
> [Update RS (ms): 1.4 2.0 2.2 1.8 1.7 1.4 2.5 2.2 1.9
> 2.5 1.7 1.7 0.1
> Avg: 1.8, Min: 0.1, Max: 2.5]
> [Processed Buffers : 8 1 3 1 1 7 3 2 6 2 7 8 3
> Sum: 52, Avg: 4, Min: 1, Max: 8]
> [Ext Root Scanning (ms): 0.7 0.5 0.5 0.3 0.5 0.7 0.4 0.5
> 0.4 0.5 0.3 0.3 0.0
> Avg: 0.4, Min: 0.0, Max: 0.7]
> [Mark Stack Scanning (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0
> 0.0 0.0 0.0 0.0 0.0 0.0
> Avg: 0.0, Min: 0.0, Max: 0.0]
> [Scan RS (ms): 0.9 0.4 0.1 0.7 0.7 0.9 0.0 0.1 0.6 0.0
> 0.8 0.8 0.9
> Avg: 0.5, Min: 0.0, Max: 0.9]
> [Object Copy (ms): 7.2 7.2 7.3 7.3 7.1 7.1 7.0 7.2 7.1
> 6.9 7.1 7.1 7.0
> Avg: 7.1, Min: 6.9, Max: 7.3]
> [Termination (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
> 0.0 0.0 0.0 0.0
> Avg: 0.0, Min: 0.0, Max: 0.0]
> [Other: 0.6 ms]
> [Clear CT: 0.5 ms]
> [Other: 1.1 ms]
> [Choose CSet: 0.0 ms]
> [ 7677M->7636M(8000M)]
> [Times: user=0.12 sys=0.00, real=0.01 secs]
>
> But every 5-10 minutes I see a GC pause that lasts 10-15 seconds:
> [todd@monster01 logs]$ grep 'Full GC' gc-hbase.log | tail
> 2010-07-06T12:50:41.216-0700: 52521.446: [Full GC 7934M->4865M(8000M),
> 9.8907800 secs]
> 2010-07-06T12:55:39.802-0700: 52820.032: [Full GC 7930M->4964M(8000M),
> 9.9025520 secs]
> 2010-07-06T13:02:26.872-0700: 53227.102: [Full GC 7934M->4882M(8000M),
> 10.1232190 secs]
> 2010-07-06T13:09:41.049-0700: 53661.279: [Full GC 7938M->5002M(8000M),
> 10.4997760 secs]
> 2010-07-06T13:18:51.531-0700: 54211.761: [Full GC 7938M->4962M(8000M),
> 11.0497380 secs]
>
> These pauses are pretty unacceptable for soft real time operation.
>
> Am I missing some tuning that should be done for G1GC for applications
> like this? Is 20ms out of 80ms too aggressive a target for the garbage
> rates we're generating?
>
> My actual live heap usage should be very stable around 5GB - the
> application very carefully accounts its live object set at around 60%
> of the max heap (as you can see in the logs above).
>
> At this point we are considering doing crazy things like ripping out
> our main memory consumers into a custom slab allocator, and manually
> reference count the byte array slices. But, if we can get G1GC to work
> for us, it will save a lot of engineering on the application side!
>
> Thanks
> -Todd
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
> ------------------------------------------------------------------------
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
Did you try doubling the heap size? You might want to post a full
log so we can see what's happening between those full collections.
Also, If you have comparable CMS logs
all the better, as a known starting point. The full gc's almost
look like the heap got too full, so it must mean that incremental
collection is not keeping up with the rate of garbage generation.
Also, what's the JDK build you are running?
-- ramki
On 07/06/10 13:27, Todd Lipcon wrote:
> Hi all,
>
> I work on HBase, a distributed database written in Java. We generally
> run on large heaps (8GB+), and our object lifetime distribution has
> proven pretty problematic for garbage collection (we manage a multi-GB
> LRU cache inside the process, so in CMS we tenure a lot of byte arrays
> which later get collected).
>
> In Java6, we generally run with the CMS collector, tuning down
> CMSInitiatingOccupancyFraction and constraining MaxNewSize to achieve
> fairly low pause GC, but after a week or two of uptime we often run into
> full heap compaction which takes several minutes and wreaks havoc on the
> system.
>
> Needless to say, we've been watching the development of the G1 GC with
> anticipation for the last year or two. Finally it seems in the latest
> build of JDK7 it's stable enough for actual use (previously it would
> segfault within an hour or so). However, in my testing I'm seeing a fair
> amount of 8-10 second Full GC pauses.
>
> The flags I'm passing are:
>
> -Xmx8000m -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:GCPauseIntervalMillis=80
>
> Most of the pauses I see in the GC log are around 10-20ms as expected:
>
> 2010-07-05T22:43:19.849-0700: 1680.079: [GC pause (partial), 0.01209000
> secs]
> [Parallel Time: 10.5 ms]
> [Update RS (Start) (ms): 1680080.2 1680080.1 1680080.2
> 1680079.9 1680080.0 1680080.2 1680080.1 1680080.1 1680080.0
> 1680080.1 1680080.0 1680079.9 1680081.5]
> [Update RS (ms): 1.4 2.0 2.2 1.8 1.7 1.4 2.5 2.2 1.9 2.5
> 1.7 1.7 0.1
> Avg: 1.8, Min: 0.1, Max: 2.5]
> [Processed Buffers : 8 1 3 1 1 7 3 2 6 2 7 8 3
> Sum: 52, Avg: 4, Min: 1, Max: 8]
> [Ext Root Scanning (ms): 0.7 0.5 0.5 0.3 0.5 0.7 0.4 0.5
> 0.4 0.5 0.3 0.3 0.0
> Avg: 0.4, Min: 0.0, Max: 0.7]
> [Mark Stack Scanning (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
> 0.0 0.0 0.0 0.0 0.0
> Avg: 0.0, Min: 0.0, Max: 0.0]
> [Scan RS (ms): 0.9 0.4 0.1 0.7 0.7 0.9 0.0 0.1 0.6 0.0
> 0.8 0.8 0.9
> Avg: 0.5, Min: 0.0, Max: 0.9]
> [Object Copy (ms): 7.2 7.2 7.3 7.3 7.1 7.1 7.0 7.2 7.1
> 6.9 7.1 7.1 7.0
> Avg: 7.1, Min: 6.9, Max: 7.3]
> [Termination (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
> 0.0 0.0 0.0 0.0
> Avg: 0.0, Min: 0.0, Max: 0.0]
> [Other: 0.6 ms]
> [Clear CT: 0.5 ms]
> [Other: 1.1 ms]
> [Choose CSet: 0.0 ms]
> [ 7677M->7636M(8000M)]
> [Times: user=0.12 sys=0.00, real=0.01 secs]
>
> But every 5-10 minutes I see a GC pause that lasts 10-15 seconds:
> [todd@monster01 logs]$ grep 'Full GC' gc-hbase.log | tail
> 2010-07-06T12:50:41.216-0700: 52521.446: [Full GC 7934M->4865M(8000M),
> 9.8907800 secs]
> 2010-07-06T12:55:39.802-0700: 52820.032: [Full GC 7930M->4964M(8000M),
> 9.9025520 secs]
> 2010-07-06T13:02:26.872-0700: 53227.102: [Full GC 7934M->4882M(8000M),
> 10.1232190 secs]
> 2010-07-06T13:09:41.049-0700: 53661.279: [Full GC 7938M->5002M(8000M),
> 10.4997760 secs]
> 2010-07-06T13:18:51.531-0700: 54211.761: [Full GC 7938M->4962M(8000M),
> 11.0497380 secs]
>
> These pauses are pretty unacceptable for soft real time operation.
>
> Am I missing some tuning that should be done for G1GC for applications
> like this? Is 20ms out of 80ms too aggressive a target for the garbage
> rates we're generating?
>
> My actual live heap usage should be very stable around 5GB - the
> application very carefully accounts its live object set at around 60% of
> the max heap (as you can see in the logs above).
>
> At this point we are considering doing crazy things like ripping out our
> main memory consumers into a custom slab allocator, and manually
> reference count the byte array slices. But, if we can get G1GC to work
> for us, it will save a lot of engineering on the application side!
>
> Thanks
> -Todd
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>
>
> ------------------------------------------------------------------------
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
I also work with Todd on this systems (I am one of the other people
with the alternate CMS config) and doubling the heap size from 8 GB to
16GB is a little insane... we'd like to have some amount of reasonable
memory efficiency here... The thing is the more we can get out of our
ram for this block cache, the better performing our systems are. Also
a lot of the settings are self tuning, so if we up the Xmx the size of
the block cache is scaled as well.
-ryan
On Tue, Jul 6, 2010 at 2:12 PM, Y. S. Ramakrishna
<> wrote:
> Did you try doubling the heap size? You might want to post a full
> log so we can see what's happening between those full collections.
> Also, If you have comparable CMS logs
> all the better, as a known starting point. The full gc's almost
> look like the heap got too full, so it must mean that incremental
> collection is not keeping up with the rate of garbage generation.
> Also, what's the JDK build you are running?
>
> -- ramki
>
> On 07/06/10 13:27, Todd Lipcon wrote:
>> Hi all,
>>
>> I work on HBase, a distributed database written in Java. We generally
>> run on large heaps (8GB+), and our object lifetime distribution has
>> proven pretty problematic for garbage collection (we manage a multi-GB
>> LRU cache inside the process, so in CMS we tenure a lot of byte arrays
>> which later get collected).
>>
>> In Java6, we generally run with the CMS collector, tuning down
>> CMSInitiatingOccupancyFraction and constraining MaxNewSize to achieve
>> fairly low pause GC, but after a week or two of uptime we often run into
>> full heap compaction which takes several minutes and wreaks havoc on the
>> system.
>>
>> Needless to say, we've been watching the development of the G1 GC with
>> anticipation for the last year or two. Finally it seems in the latest
>> build of JDK7 it's stable enough for actual use (previously it would
>> segfault within an hour or so). However, in my testing I'm seeing a fair
>> amount of 8-10 second Full GC pauses.
>>
>> The flags I'm passing are:
>>
>> -Xmx8000m -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:GCPauseIntervalMillis=80
>>
>> Most of the pauses I see in the GC log are around 10-20ms as expected:
>>
>> 2010-07-05T22:43:19.849-0700: 1680.079: [GC pause (partial), 0.01209000
>> secs]
>> [Parallel Time: 10.5 ms]
>> [Update RS (Start) (ms): 1680080.2 1680080.1 1680080.2
>> 1680079.9 1680080.0 1680080.2 1680080.1 1680080.1 1680080.0
>> 1680080.1 1680080.0 1680079.9 1680081.5]
>> [Update RS (ms): 1.4 2.0 2.2 1.8 1.7 1.4 2.5 2.2 1.9 2.5
>> 1.7 1.7 0.1
>> Avg: 1.8, Min: 0.1, Max: 2.5]
>> [Processed Buffers : 8 1 3 1 1 7 3 2 6 2 7 8 3
>> Sum: 52, Avg: 4, Min: 1, Max: 8]
>> [Ext Root Scanning (ms): 0.7 0.5 0.5 0.3 0.5 0.7 0.4 0.5
>> 0.4 0.5 0.3 0.3 0.0
>> Avg: 0.4, Min: 0.0, Max: 0.7]
>> [Mark Stack Scanning (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
>> 0.0 0.0 0.0 0.0 0.0
>> Avg: 0.0, Min: 0.0, Max: 0.0]
>> [Scan RS (ms): 0.9 0.4 0.1 0.7 0.7 0.9 0.0 0.1 0.6 0.0
>> 0.8 0.8 0.9
>> Avg: 0.5, Min: 0.0, Max: 0.9]
>> [Object Copy (ms): 7.2 7.2 7.3 7.3 7.1 7.1 7.0 7.2 7.1
>> 6.9 7.1 7.1 7.0
>> Avg: 7.1, Min: 6.9, Max: 7.3]
>> [Termination (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
>> 0.0 0.0 0.0 0.0
>> Avg: 0.0, Min: 0.0, Max: 0.0]
>> [Other: 0.6 ms]
>> [Clear CT: 0.5 ms]
>> [Other: 1.1 ms]
>> [Choose CSet: 0.0 ms]
>> [ 7677M->7636M(8000M)]
>> [Times: user=0.12 sys=0.00, real=0.01 secs]
>>
>> But every 5-10 minutes I see a GC pause that lasts 10-15 seconds:
>> [todd@monster01 logs]$ grep 'Full GC' gc-hbase.log | tail
>> 2010-07-06T12:50:41.216-0700: 52521.446: [Full GC 7934M->4865M(8000M),
>> 9.8907800 secs]
>> 2010-07-06T12:55:39.802-0700: 52820.032: [Full GC 7930M->4964M(8000M),
>> 9.9025520 secs]
>> 2010-07-06T13:02:26.872-0700: 53227.102: [Full GC 7934M->4882M(8000M),
>> 10.1232190 secs]
>> 2010-07-06T13:09:41.049-0700: 53661.279: [Full GC 7938M->5002M(8000M),
>> 10.4997760 secs]
>> 2010-07-06T13:18:51.531-0700: 54211.761: [Full GC 7938M->4962M(8000M),
>> 11.0497380 secs]
>>
>> These pauses are pretty unacceptable for soft real time operation.
>>
>> Am I missing some tuning that should be done for G1GC for applications
>> like this? Is 20ms out of 80ms too aggressive a target for the garbage
>> rates we're generating?
>>
>> My actual live heap usage should be very stable around 5GB - the
>> application very carefully accounts its live object set at around 60% of
>> the max heap (as you can see in the logs above).
>>
>> At this point we are considering doing crazy things like ripping out our
>> main memory consumers into a custom slab allocator, and manually
>> reference count the byte array slices. But, if we can get G1GC to work
>> for us, it will save a lot of engineering on the application side!
>>
>> Thanks
>> -Todd
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
>>
>> ------------------------------------------------------------------------
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
Two questions:
(1) do you enable class unloading with CMS? (I do not see that
below in yr option list, but wondered.)
(2) does your application load classes, or intern a lot of strings?
If i am reading the logs right, G1 appears to reclaim less
and less of the heap in each cycle until a full collection
intervenes, and I have no real explanation for this behaviour
except that perhaps there's something in the perm gen that
keeps stuff in the remainder of the heap artificially live.
G1 does not incrementally collect the young gen, so this is
plausible. But CMS does not either by default and I do not see
that option in the CMS options list you gave below. It would
be instructive to see what the comparable CMS logs look like.
May be then you could start with the same heap shapes for the
two and see if you can get to the bottom of the full gc (which
as i understand you get to more quickly w/G1 than you did
w/CMS).
-- ramki
On 07/06/10 14:24, Todd Lipcon wrote:
> On Tue, Jul 6, 2010 at 2:09 PM, Jon Masamitsu <
> > wrote:
>
> Todd,
>
> Could you send a segment of the GC logs from the beginning
> through the first dozen or so full GC's?
>
>
> Sure, I just put it online at:
>
> http://cloudera-todd.s3.amazonaws.com/gc-log-g1gc.txt
>
>
>
> Exactly which version of the JVM are you using?
>
> java -version
>
> will tell us.
>
>
> Latest as of last night:
>
> [todd@monster01 ~]$ ./jdk1.7.0/jre/bin/java -version
> java version "1.7.0-ea"
> Java(TM) SE Runtime Environment (build 1.7.0-ea-b99)
> Java HotSpot(TM) 64-Bit Server VM (build 19.0-b03, mixed mode)
>
>
> Do you have a test setup where you could do some experiments?
>
>
> Sure, I have a five node cluster here where I do lots of testing, happy
> to try different builds/options/etc (though I probably don't have time
> to apply patches and rebuild the JDK myself)
>
>
> Can you send the set of CMS flags you use? It might tell
> us something about the GC behavior of you application.
> Might not tell us anything but it's worth a look.
>
>
> Different customers have found different flags to work well for them.
> One user uses the following:
>
>
> -Xmx12000m -XX:+HeapDumpOnOutOfMemoryError -XX:+UseConcMarkSweepGC \
> -XX:NewSize=64m -XX:MaxNewSize=64m -XX:CMSInitiatingOccupancyFraction=40 \
>
>
> Another uses:
>
>
> -XX:+DoEscapeAnalysis -XX:+AggressiveOpts -XX:+UseConcMarkSweepGC
> -XX:NewSize=64m -XX:MaxNewSize=64m
> -XX:CMSInitiatingOccupancyFraction=88 -verbose:gc -XX:+PrintGCDetails
>
>
>
>
>
>
>
> The particular tuning options probably depend on the actual cache
> workload of the user. I tend to recommend CMSInitiatingOccupancyFraction
> around 75 or so, since the software maintains about 60% heap usage. I
> also think a NewSize slightly larger would improve things a bit, but if
> it gets more than 256m or so, the ParNew pauses start to be too long for
> a lot of use cases.
>
> Regarding CMS logs, I can probably restart this test later this
> afternoon on CMS and run it for a couple hours, but it isn't likely to
> hit the multi-minute compaction that quickly. It happens more in the wild.
>
> -Todd
>
>
>
> On 07/06/10 13:27, Todd Lipcon wrote:
>> Hi all,
>>
>> I work on HBase, a distributed database written in Java. We
>> generally run on large heaps (8GB+), and our object lifetime
>> distribution has proven pretty problematic for garbage collection
>> (we manage a multi-GB LRU cache inside the process, so in CMS we
>> tenure a lot of byte arrays which later get collected).
>>
>> In Java6, we generally run with the CMS collector, tuning down
>> CMSInitiatingOccupancyFraction and constraining MaxNewSize to
>> achieve fairly low pause GC, but after a week or two of uptime we
>> often run into full heap compaction which takes several minutes
>> and wreaks havoc on the system.
>>
>> Needless to say, we've been watching the development of the G1 GC
>> with anticipation for the last year or two. Finally it seems in
>> the latest build of JDK7 it's stable enough for actual use
>> (previously it would segfault within an hour or so). However, in
>> my testing I'm seeing a fair amount of 8-10 second Full GC pauses.
>>
>> The flags I'm passing are:
>>
>> -Xmx8000m -XX:+UseG1GC -XX:MaxGCPauseMillis=20
>> -XX:GCPauseIntervalMillis=80
>>
>> Most of the pauses I see in the GC log are around 10-20ms as expected:
>>
>> 2010-07-05T22:43:19.849-0700: 1680.079: [GC pause (partial),
>> 0.01209000 secs]
>> [Parallel Time: 10.5 ms]
>> [Update RS (Start) (ms): 1680080.2 1680080.1 1680080.2
>> 1680079.9 1680080.0 1680080.2 1680080.1 1680080.1 1680080.0
>> 1680080.1 1680080.0 1680079.9 1680081.5]
>> [Update RS (ms): 1.4 2.0 2.2 1.8 1.7 1.4 2.5 2.2
>> 1.9 2.5 1.7 1.7 0.1
>> Avg: 1.8, Min: 0.1, Max: 2.5]
>> [Processed Buffers : 8 1 3 1 1 7 3 2 6 2 7 8 3
>> Sum: 52, Avg: 4, Min: 1, Max: 8]
>> [Ext Root Scanning (ms): 0.7 0.5 0.5 0.3 0.5 0.7 0.4
>> 0.5 0.4 0.5 0.3 0.3 0.0
>> Avg: 0.4, Min: 0.0, Max: 0.7]
>> [Mark Stack Scanning (ms): 0.0 0.0 0.0 0.0 0.0 0.0
>> 0.0 0.0 0.0 0.0 0.0 0.0 0.0
>> Avg: 0.0, Min: 0.0, Max: 0.0]
>> [Scan RS (ms): 0.9 0.4 0.1 0.7 0.7 0.9 0.0 0.1 0.6
>> 0.0 0.8 0.8 0.9
>> Avg: 0.5, Min: 0.0, Max: 0.9]
>> [Object Copy (ms): 7.2 7.2 7.3 7.3 7.1 7.1 7.0 7.2
>> 7.1 6.9 7.1 7.1 7.0
>> Avg: 7.1, Min: 6.9, Max: 7.3]
>> [Termination (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
>> 0.0 0.0 0.0 0.0 0.0
>> Avg: 0.0, Min: 0.0, Max: 0.0]
>> [Other: 0.6 ms]
>> [Clear CT: 0.5 ms]
>> [Other: 1.1 ms]
>> [Choose CSet: 0.0 ms]
>> [ 7677M->7636M(8000M)]
>> [Times: user=0.12 sys=0.00, real=0.01 secs]
>>
>> But every 5-10 minutes I see a GC pause that lasts 10-15 seconds:
>> [todd@monster01 logs]$ grep 'Full GC' gc-hbase.log | tail
>> 2010-07-06T12:50:41.216-0700: 52521.446: [Full GC
>> 7934M->4865M(8000M), 9.8907800 secs]
>> 2010-07-06T12:55:39.802-0700: 52820.032: [Full GC
>> 7930M->4964M(8000M), 9.9025520 secs]
>> 2010-07-06T13:02:26.872-0700: 53227.102: [Full GC
>> 7934M->4882M(8000M), 10.1232190 secs]
>> 2010-07-06T13:09:41.049-0700: 53661.279: [Full GC
>> 7938M->5002M(8000M), 10.4997760 secs]
>> 2010-07-06T13:18:51.531-0700: 54211.761: [Full GC
>> 7938M->4962M(8000M), 11.0497380 secs]
>>
>> These pauses are pretty unacceptable for soft real time operation.
>>
>> Am I missing some tuning that should be done for G1GC for
>> applications like this? Is 20ms out of 80ms too aggressive a
>> target for the garbage rates we're generating?
>>
>> My actual live heap usage should be very stable around 5GB - the
>> application very carefully accounts its live object set at around
>> 60% of the max heap (as you can see in the logs above).
>>
>> At this point we are considering doing crazy things like ripping
>> out our main memory consumers into a custom slab allocator, and
>> manually reference count the byte array slices. But, if we can get
>> G1GC to work for us, it will save a lot of engineering on the
>> application side!
>>
>> Thanks
>> -Todd
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>> ------------------------------------------------------------------------
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
On 07/07/10 08:45, Todd Lipcon wrote:
...
>
> Overnight I saw one "concurrent mode failure".
...
> 2010-07-07T07:56:27.786-0700: 28490.203: [GC 28490.203: [ParNew
> (promotion failed): 59008K->59008K(59008K), 0.0179250 secs]28490.221:
> [CMS2010-07-07T07:56:27.901-0700: 28490.317: [CMS-concurrent-preclean:
> 0.556/0.947 secs] [Times:
> user=5.76 sys=0.26, real=0.95 secs]
> (concurrent mode failure): 6359176K->4206871K(8323072K), 17.4366220
> secs] 6417373K->4206871K(8382080K), [CMS Perm : 18609K->18565K(31048K)],
> 17.4546890 secs] [Times: user=11.17 sys=0.09, real=17.45 secs]
>
> I've interpreted pauses like this as being caused by fragmentation,
> since the young gen is 64M, and the old gen here has about 2G free. If
> there's something I'm not understanding about CMS, and I can tune it
> more smartly to avoid these longer pauses, I'm happy to try.
Yes the old gen must be fragmented. I'll look at the data you have
made available (for CMS). The CMS log you uploaded does not have the
suffix leading into the concurrent mode failure ypu display above
(it stops less than 2500 s into the run). If you could include
the entire log leading into the concurrent mode failures, it would
be a great help. Do you have large arrays in your
application? The shape of the promotion graph for CMS is somewhat
jagged, indicating _perhaps_ that. Yes, +PrintTenuringDistribution
would shed a bit more light. As regards fragmentation, it can be
tricky to tune against, but we can try once we understand a bit
more about the object sizes and demographics.
I am sure you don't have an easily shared test case, so we
can reproduce both the CMS fragmentation and the G1 full gc
issues locally for quickest progress on this?
-- ramki
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
On 07/07/10 17:18, Todd Lipcon wrote:
...
> Looking at the graph you attached, it appears that the low-water mark
> stabilizes at somewhere between 4.5G and 5G. The configuration I'm
> running is to allocate 40% of the heap to Memstore and 20% of the heap
> to the LRU cache. For an 8G heap, this is 4.8GB. So, for this
> application it's somewhat expected that, as it runs, it will accumulate
> more and more data until it reaches this threshold. The data is, of
> course, not *permanent*, but it's reasonably long-lived, so it makes
> sense to me that it should go into the old generation.
Ah, i see. In that case, i think you could try using a slightly larger old
gen. If the old gen stabilizes at 4.2 GB, we should allow as much for slop.
i.e. make the old gen 8.4 GB (or whatever is the measured stable
old gen occupancy), then add to that the young gen size, and use
that for the whole heap. I would be even more aggressive
and grant more to the old gen -- as i said earlier perhaps
double the old gen from its present size. If that doesn;t work
we know that something is amiss in the way we are going at this.
If it works, we can iterate downwards from a config that we know
works, down to what may be considered an acceptable space overhead
for GC.
>
> If you like, I can tune down those percentages to 20/20 instead of
> 20/40, and I think we'll see the same pattern, just stabilized around
> 3.2GB. This will probably delay the full GCs, but still eventually hit
> them. It's also way lower than we can really go - customers won't like
> "throwing away" 60% of the allocated heap to GC!
I understand that sentiment. I want us to get to a state where we are able
to completely avoid the creeping fragmentation, if possible. There are
other ways to tune for this, but they are more labour-intensive and tricky,
and I would not want to go into that lightly. You might want to contact
your Java support for help with that.
>
>
> Perhaps jmap -histo:live or +PrintClassHistogram[AfterFullGC]
> will help you get to the bottom of yr leak. Once the leak is plugged
> perhaps we could come back to the G1 tuning effort? (We have some
> guesses as to what might be happening and the best G1 minds are
> chewing on the info you provided so far, for which thanks!)
>
>
> I can try running with those options and see what I see, but I've
> already spent some time looking at heap dumps, and not found any leaks,
> so I'm pretty sure it's not the issue.
OK, in that case it's not worth doing, since you've already ruled
out leaks.
I'll think some more about this meanwhile.
thanks.
-- ramki
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
On 07/07/10 17:32, Todd Lipcon wrote:
...
> OK, I can try some tests with cache configured for only 40% heap usage.
> Should I run these tests with CMS or G1?
I'd first try CMS, and if that works, try G1.
>
>
>
>
> If you like, I can tune down those percentages to 20/20 instead
> of 20/40, and I think we'll see the same pattern, just
> stabilized around 3.2GB. This will probably delay the full GCs,
> but still eventually hit them. It's also way lower than we can
> really go - customers won't like "throwing away" 60% of the
> allocated heap to GC!
>
>
> I understand that sentiment. I want us to get to a state where we
> are able
> to completely avoid the creeping fragmentation, if possible. There are
> other ways to tune for this, but they are more labour-intensive and
> tricky,
> and I would not want to go into that lightly. You might want to contact
> your Java support for help with that.
>
>
> Yep, we've considered various solutions involving managing our own
> ref-counted slices of a single pre-allocated byte array - essentially
> writing our own slab allocator. In theory this should make all of the
> GCable objects constrained to a small number of sizes, and thus prevent
> fragmentation, but it's quite a project to undertake :)
That would be overdoing it. I didn't mean anything so drastic and certainly
nothing so drastic at the application level. When I said "labour intensive"
I meant tuning GC to avoid that kind of fragmentation would be more work.
>
> Regarding Java support, as an open source project we have no such
> luxury. Projects like HBase and Hadoop, though, are pretty visible to
> users as "big Java apps", so getting them working well on the GC front
> does good things for Java adoption in the database/distributed systems
> community, I think.
I agree, and we certainly should.
-- ramki
>
> -Todd
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>
>
> ------------------------------------------------------------------------
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
|
# 8

12-07-2010 05:02 PM
|
|
|
Todd,
Could you send a segment of the GC logs from the beginning
through the first dozen or so full GC's?
Exactly which version of the JVM are you using?
java -version
will tell us.
Do you have a test setup where you could do some experiments?
Can you send the set of CMS flags you use? It might tell
us something about the GC behavior of you application.
Might not tell us anything but it's worth a look.
Jon
On 07/06/10 13:27, Todd Lipcon wrote:
> Hi all,
>
> I work on HBase, a distributed database written in Java. We generally
> run on large heaps (8GB+), and our object lifetime distribution has
> proven pretty problematic for garbage collection (we manage a multi-GB
> LRU cache inside the process, so in CMS we tenure a lot of byte arrays
> which later get collected).
>
> In Java6, we generally run with the CMS collector, tuning down
> CMSInitiatingOccupancyFraction and constraining MaxNewSize to achieve
> fairly low pause GC, but after a week or two of uptime we often run
> into full heap compaction which takes several minutes and wreaks havoc
> on the system.
>
> Needless to say, we've been watching the development of the G1 GC with
> anticipation for the last year or two. Finally it seems in the latest
> build of JDK7 it's stable enough for actual use (previously it would
> segfault within an hour or so). However, in my testing I'm seeing a
> fair amount of 8-10 second Full GC pauses.
>
> The flags I'm passing are:
>
> -Xmx8000m -XX:+UseG1GC -XX:MaxGCPauseMillis=20
> -XX:GCPauseIntervalMillis=80
>
> Most of the pauses I see in the GC log are around 10-20ms as expected:
>
> 2010-07-05T22:43:19.849-0700: 1680.079: [GC pause (partial),
> 0.01209000 secs]
> [Parallel Time: 10.5 ms]
> [Update RS (Start) (ms): 1680080.2 1680080.1 1680080.2
> 1680079.9 1680080.0 1680080.2 1680080.1 1680080.1 1680080.0
> 1680080.1 1680080.0 1680079.9 1680081.5]
> [Update RS (ms): 1.4 2.0 2.2 1.8 1.7 1.4 2.5 2.2 1.9
> 2.5 1.7 1.7 0.1
> Avg: 1.8, Min: 0.1, Max: 2.5]
> [Processed Buffers : 8 1 3 1 1 7 3 2 6 2 7 8 3
> Sum: 52, Avg: 4, Min: 1, Max: 8]
> [Ext Root Scanning (ms): 0.7 0.5 0.5 0.3 0.5 0.7 0.4 0.5
> 0.4 0.5 0.3 0.3 0.0
> Avg: 0.4, Min: 0.0, Max: 0.7]
> [Mark Stack Scanning (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0
> 0.0 0.0 0.0 0.0 0.0 0.0
> Avg: 0.0, Min: 0.0, Max: 0.0]
> [Scan RS (ms): 0.9 0.4 0.1 0.7 0.7 0.9 0.0 0.1 0.6 0.0
> 0.8 0.8 0.9
> Avg: 0.5, Min: 0.0, Max: 0.9]
> [Object Copy (ms): 7.2 7.2 7.3 7.3 7.1 7.1 7.0 7.2 7.1
> 6.9 7.1 7.1 7.0
> Avg: 7.1, Min: 6.9, Max: 7.3]
> [Termination (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
> 0.0 0.0 0.0 0.0
> Avg: 0.0, Min: 0.0, Max: 0.0]
> [Other: 0.6 ms]
> [Clear CT: 0.5 ms]
> [Other: 1.1 ms]
> [Choose CSet: 0.0 ms]
> [ 7677M->7636M(8000M)]
> [Times: user=0.12 sys=0.00, real=0.01 secs]
>
> But every 5-10 minutes I see a GC pause that lasts 10-15 seconds:
> [todd@monster01 logs]$ grep 'Full GC' gc-hbase.log | tail
> 2010-07-06T12:50:41.216-0700: 52521.446: [Full GC 7934M->4865M(8000M),
> 9.8907800 secs]
> 2010-07-06T12:55:39.802-0700: 52820.032: [Full GC 7930M->4964M(8000M),
> 9.9025520 secs]
> 2010-07-06T13:02:26.872-0700: 53227.102: [Full GC 7934M->4882M(8000M),
> 10.1232190 secs]
> 2010-07-06T13:09:41.049-0700: 53661.279: [Full GC 7938M->5002M(8000M),
> 10.4997760 secs]
> 2010-07-06T13:18:51.531-0700: 54211.761: [Full GC 7938M->4962M(8000M),
> 11.0497380 secs]
>
> These pauses are pretty unacceptable for soft real time operation.
>
> Am I missing some tuning that should be done for G1GC for applications
> like this? Is 20ms out of 80ms too aggressive a target for the garbage
> rates we're generating?
>
> My actual live heap usage should be very stable around 5GB - the
> application very carefully accounts its live object set at around 60%
> of the max heap (as you can see in the logs above).
>
> At this point we are considering doing crazy things like ripping out
> our main memory consumers into a custom slab allocator, and manually
> reference count the byte array slices. But, if we can get G1GC to work
> for us, it will save a lot of engineering on the application side!
>
> Thanks
> -Todd
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
> ------------------------------------------------------------------------
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
Did you try doubling the heap size? You might want to post a full
log so we can see what's happening between those full collections.
Also, If you have comparable CMS logs
all the better, as a known starting point. The full gc's almost
look like the heap got too full, so it must mean that incremental
collection is not keeping up with the rate of garbage generation.
Also, what's the JDK build you are running?
-- ramki
On 07/06/10 13:27, Todd Lipcon wrote:
> Hi all,
>
> I work on HBase, a distributed database written in Java. We generally
> run on large heaps (8GB+), and our object lifetime distribution has
> proven pretty problematic for garbage collection (we manage a multi-GB
> LRU cache inside the process, so in CMS we tenure a lot of byte arrays
> which later get collected).
>
> In Java6, we generally run with the CMS collector, tuning down
> CMSInitiatingOccupancyFraction and constraining MaxNewSize to achieve
> fairly low pause GC, but after a week or two of uptime we often run into
> full heap compaction which takes several minutes and wreaks havoc on the
> system.
>
> Needless to say, we've been watching the development of the G1 GC with
> anticipation for the last year or two. Finally it seems in the latest
> build of JDK7 it's stable enough for actual use (previously it would
> segfault within an hour or so). However, in my testing I'm seeing a fair
> amount of 8-10 second Full GC pauses.
>
> The flags I'm passing are:
>
> -Xmx8000m -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:GCPauseIntervalMillis=80
>
> Most of the pauses I see in the GC log are around 10-20ms as expected:
>
> 2010-07-05T22:43:19.849-0700: 1680.079: [GC pause (partial), 0.01209000
> secs]
> [Parallel Time: 10.5 ms]
> [Update RS (Start) (ms): 1680080.2 1680080.1 1680080.2
> 1680079.9 1680080.0 1680080.2 1680080.1 1680080.1 1680080.0
> 1680080.1 1680080.0 1680079.9 1680081.5]
> [Update RS (ms): 1.4 2.0 2.2 1.8 1.7 1.4 2.5 2.2 1.9 2.5
> 1.7 1.7 0.1
> Avg: 1.8, Min: 0.1, Max: 2.5]
> [Processed Buffers : 8 1 3 1 1 7 3 2 6 2 7 8 3
> Sum: 52, Avg: 4, Min: 1, Max: 8]
> [Ext Root Scanning (ms): 0.7 0.5 0.5 0.3 0.5 0.7 0.4 0.5
> 0.4 0.5 0.3 0.3 0.0
> Avg: 0.4, Min: 0.0, Max: 0.7]
> [Mark Stack Scanning (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
> 0.0 0.0 0.0 0.0 0.0
> Avg: 0.0, Min: 0.0, Max: 0.0]
> [Scan RS (ms): 0.9 0.4 0.1 0.7 0.7 0.9 0.0 0.1 0.6 0.0
> 0.8 0.8 0.9
> Avg: 0.5, Min: 0.0, Max: 0.9]
> [Object Copy (ms): 7.2 7.2 7.3 7.3 7.1 7.1 7.0 7.2 7.1
> 6.9 7.1 7.1 7.0
> Avg: 7.1, Min: 6.9, Max: 7.3]
> [Termination (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
> 0.0 0.0 0.0 0.0
> Avg: 0.0, Min: 0.0, Max: 0.0]
> [Other: 0.6 ms]
> [Clear CT: 0.5 ms]
> [Other: 1.1 ms]
> [Choose CSet: 0.0 ms]
> [ 7677M->7636M(8000M)]
> [Times: user=0.12 sys=0.00, real=0.01 secs]
>
> But every 5-10 minutes I see a GC pause that lasts 10-15 seconds:
> [todd@monster01 logs]$ grep 'Full GC' gc-hbase.log | tail
> 2010-07-06T12:50:41.216-0700: 52521.446: [Full GC 7934M->4865M(8000M),
> 9.8907800 secs]
> 2010-07-06T12:55:39.802-0700: 52820.032: [Full GC 7930M->4964M(8000M),
> 9.9025520 secs]
> 2010-07-06T13:02:26.872-0700: 53227.102: [Full GC 7934M->4882M(8000M),
> 10.1232190 secs]
> 2010-07-06T13:09:41.049-0700: 53661.279: [Full GC 7938M->5002M(8000M),
> 10.4997760 secs]
> 2010-07-06T13:18:51.531-0700: 54211.761: [Full GC 7938M->4962M(8000M),
> 11.0497380 secs]
>
> These pauses are pretty unacceptable for soft real time operation.
>
> Am I missing some tuning that should be done for G1GC for applications
> like this? Is 20ms out of 80ms too aggressive a target for the garbage
> rates we're generating?
>
> My actual live heap usage should be very stable around 5GB - the
> application very carefully accounts its live object set at around 60% of
> the max heap (as you can see in the logs above).
>
> At this point we are considering doing crazy things like ripping out our
> main memory consumers into a custom slab allocator, and manually
> reference count the byte array slices. But, if we can get G1GC to work
> for us, it will save a lot of engineering on the application side!
>
> Thanks
> -Todd
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>
>
> ------------------------------------------------------------------------
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
I also work with Todd on this systems (I am one of the other people
with the alternate CMS config) and doubling the heap size from 8 GB to
16GB is a little insane... we'd like to have some amount of reasonable
memory efficiency here... The thing is the more we can get out of our
ram for this block cache, the better performing our systems are. Also
a lot of the settings are self tuning, so if we up the Xmx the size of
the block cache is scaled as well.
-ryan
On Tue, Jul 6, 2010 at 2:12 PM, Y. S. Ramakrishna
<> wrote:
> Did you try doubling the heap size? You might want to post a full
> log so we can see what's happening between those full collections.
> Also, If you have comparable CMS logs
> all the better, as a known starting point. The full gc's almost
> look like the heap got too full, so it must mean that incremental
> collection is not keeping up with the rate of garbage generation.
> Also, what's the JDK build you are running?
>
> -- ramki
>
> On 07/06/10 13:27, Todd Lipcon wrote:
>> Hi all,
>>
>> I work on HBase, a distributed database written in Java. We generally
>> run on large heaps (8GB+), and our object lifetime distribution has
>> proven pretty problematic for garbage collection (we manage a multi-GB
>> LRU cache inside the process, so in CMS we tenure a lot of byte arrays
>> which later get collected).
>>
>> In Java6, we generally run with the CMS collector, tuning down
>> CMSInitiatingOccupancyFraction and constraining MaxNewSize to achieve
>> fairly low pause GC, but after a week or two of uptime we often run into
>> full heap compaction which takes several minutes and wreaks havoc on the
>> system.
>>
>> Needless to say, we've been watching the development of the G1 GC with
>> anticipation for the last year or two. Finally it seems in the latest
>> build of JDK7 it's stable enough for actual use (previously it would
>> segfault within an hour or so). However, in my testing I'm seeing a fair
>> amount of 8-10 second Full GC pauses.
>>
>> The flags I'm passing are:
>>
>> -Xmx8000m -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:GCPauseIntervalMillis=80
>>
>> Most of the pauses I see in the GC log are around 10-20ms as expected:
>>
>> 2010-07-05T22:43:19.849-0700: 1680.079: [GC pause (partial), 0.01209000
>> secs]
>> [Parallel Time: 10.5 ms]
>> [Update RS (Start) (ms): 1680080.2 1680080.1 1680080.2
>> 1680079.9 1680080.0 1680080.2 1680080.1 1680080.1 1680080.0
>> 1680080.1 1680080.0 1680079.9 1680081.5]
>> [Update RS (ms): 1.4 2.0 2.2 1.8 1.7 1.4 2.5 2.2 1.9 2.5
>> 1.7 1.7 0.1
>> Avg: 1.8, Min: 0.1, Max: 2.5]
>> [Processed Buffers : 8 1 3 1 1 7 3 2 6 2 7 8 3
>> Sum: 52, Avg: 4, Min: 1, Max: 8]
>> [Ext Root Scanning (ms): 0.7 0.5 0.5 0.3 0.5 0.7 0.4 0.5
>> 0.4 0.5 0.3 0.3 0.0
>> Avg: 0.4, Min: 0.0, Max: 0.7]
>> [Mark Stack Scanning (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
>> 0.0 0.0 0.0 0.0 0.0
>> Avg: 0.0, Min: 0.0, Max: 0.0]
>> [Scan RS (ms): 0.9 0.4 0.1 0.7 0.7 0.9 0.0 0.1 0.6 0.0
>> 0.8 0.8 0.9
>> Avg: 0.5, Min: 0.0, Max: 0.9]
>> [Object Copy (ms): 7.2 7.2 7.3 7.3 7.1 7.1 7.0 7.2 7.1
>> 6.9 7.1 7.1 7.0
>> Avg: 7.1, Min: 6.9, Max: 7.3]
>> [Termination (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
>> 0.0 0.0 0.0 0.0
>> Avg: 0.0, Min: 0.0, Max: 0.0]
>> [Other: 0.6 ms]
>> [Clear CT: 0.5 ms]
>> [Other: 1.1 ms]
>> [Choose CSet: 0.0 ms]
>> [ 7677M->7636M(8000M)]
>> [Times: user=0.12 sys=0.00, real=0.01 secs]
>>
>> But every 5-10 minutes I see a GC pause that lasts 10-15 seconds:
>> [todd@monster01 logs]$ grep 'Full GC' gc-hbase.log | tail
>> 2010-07-06T12:50:41.216-0700: 52521.446: [Full GC 7934M->4865M(8000M),
>> 9.8907800 secs]
>> 2010-07-06T12:55:39.802-0700: 52820.032: [Full GC 7930M->4964M(8000M),
>> 9.9025520 secs]
>> 2010-07-06T13:02:26.872-0700: 53227.102: [Full GC 7934M->4882M(8000M),
>> 10.1232190 secs]
>> 2010-07-06T13:09:41.049-0700: 53661.279: [Full GC 7938M->5002M(8000M),
>> 10.4997760 secs]
>> 2010-07-06T13:18:51.531-0700: 54211.761: [Full GC 7938M->4962M(8000M),
>> 11.0497380 secs]
>>
>> These pauses are pretty unacceptable for soft real time operation.
>>
>> Am I missing some tuning that should be done for G1GC for applications
>> like this? Is 20ms out of 80ms too aggressive a target for the garbage
>> rates we're generating?
>>
>> My actual live heap usage should be very stable around 5GB - the
>> application very carefully accounts its live object set at around 60% of
>> the max heap (as you can see in the logs above).
>>
>> At this point we are considering doing crazy things like ripping out our
>> main memory consumers into a custom slab allocator, and manually
>> reference count the byte array slices. But, if we can get G1GC to work
>> for us, it will save a lot of engineering on the application side!
>>
>> Thanks
>> -Todd
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
>>
>> ------------------------------------------------------------------------
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
Two questions:
(1) do you enable class unloading with CMS? (I do not see that
below in yr option list, but wondered.)
(2) does your application load classes, or intern a lot of strings?
If i am reading the logs right, G1 appears to reclaim less
and less of the heap in each cycle until a full collection
intervenes, and I have no real explanation for this behaviour
except that perhaps there's something in the perm gen that
keeps stuff in the remainder of the heap artificially live.
G1 does not incrementally collect the young gen, so this is
plausible. But CMS does not either by default and I do not see
that option in the CMS options list you gave below. It would
be instructive to see what the comparable CMS logs look like.
May be then you could start with the same heap shapes for the
two and see if you can get to the bottom of the full gc (which
as i understand you get to more quickly w/G1 than you did
w/CMS).
-- ramki
On 07/06/10 14:24, Todd Lipcon wrote:
> On Tue, Jul 6, 2010 at 2:09 PM, Jon Masamitsu <
> > wrote:
>
> Todd,
>
> Could you send a segment of the GC logs from the beginning
> through the first dozen or so full GC's?
>
>
> Sure, I just put it online at:
>
> http://cloudera-todd.s3.amazonaws.com/gc-log-g1gc.txt
>
>
>
> Exactly which version of the JVM are you using?
>
> java -version
>
> will tell us.
>
>
> Latest as of last night:
>
> [todd@monster01 ~]$ ./jdk1.7.0/jre/bin/java -version
> java version "1.7.0-ea"
> Java(TM) SE Runtime Environment (build 1.7.0-ea-b99)
> Java HotSpot(TM) 64-Bit Server VM (build 19.0-b03, mixed mode)
>
>
> Do you have a test setup where you could do some experiments?
>
>
> Sure, I have a five node cluster here where I do lots of testing, happy
> to try different builds/options/etc (though I probably don't have time
> to apply patches and rebuild the JDK myself)
>
>
> Can you send the set of CMS flags you use? It might tell
> us something about the GC behavior of you application.
> Might not tell us anything but it's worth a look.
>
>
> Different customers have found different flags to work well for them.
> One user uses the following:
>
>
> -Xmx12000m -XX:+HeapDumpOnOutOfMemoryError -XX:+UseConcMarkSweepGC \
> -XX:NewSize=64m -XX:MaxNewSize=64m -XX:CMSInitiatingOccupancyFraction=40 \
>
>
> Another uses:
>
>
> -XX:+DoEscapeAnalysis -XX:+AggressiveOpts -XX:+UseConcMarkSweepGC
> -XX:NewSize=64m -XX:MaxNewSize=64m
> -XX:CMSInitiatingOccupancyFraction=88 -verbose:gc -XX:+PrintGCDetails
>
>
>
>
>
>
>
> The particular tuning options probably depend on the actual cache
> workload of the user. I tend to recommend CMSInitiatingOccupancyFraction
> around 75 or so, since the software maintains about 60% heap usage. I
> also think a NewSize slightly larger would improve things a bit, but if
> it gets more than 256m or so, the ParNew pauses start to be too long for
> a lot of use cases.
>
> Regarding CMS logs, I can probably restart this test later this
> afternoon on CMS and run it for a couple hours, but it isn't likely to
> hit the multi-minute compaction that quickly. It happens more in the wild.
>
> -Todd
>
>
>
> On 07/06/10 13:27, Todd Lipcon wrote:
>> Hi all,
>>
>> I work on HBase, a distributed database written in Java. We
>> generally run on large heaps (8GB+), and our object lifetime
>> distribution has proven pretty problematic for garbage collection
>> (we manage a multi-GB LRU cache inside the process, so in CMS we
>> tenure a lot of byte arrays which later get collected).
>>
>> In Java6, we generally run with the CMS collector, tuning down
>> CMSInitiatingOccupancyFraction and constraining MaxNewSize to
>> achieve fairly low pause GC, but after a week or two of uptime we
>> often run into full heap compaction which takes several minutes
>> and wreaks havoc on the system.
>>
>> Needless to say, we've been watching the development of the G1 GC
>> with anticipation for the last year or two. Finally it seems in
>> the latest build of JDK7 it's stable enough for actual use
>> (previously it would segfault within an hour or so). However, in
>> my testing I'm seeing a fair amount of 8-10 second Full GC pauses.
>>
>> The flags I'm passing are:
>>
>> -Xmx8000m -XX:+UseG1GC -XX:MaxGCPauseMillis=20
>> -XX:GCPauseIntervalMillis=80
>>
>> Most of the pauses I see in the GC log are around 10-20ms as expected:
>>
>> 2010-07-05T22:43:19.849-0700: 1680.079: [GC pause (partial),
>> 0.01209000 secs]
>> [Parallel Time: 10.5 ms]
>> [Update RS (Start) (ms): 1680080.2 1680080.1 1680080.2
>> 1680079.9 1680080.0 1680080.2 1680080.1 1680080.1 1680080.0
>> 1680080.1 1680080.0 1680079.9 1680081.5]
>> [Update RS (ms): 1.4 2.0 2.2 1.8 1.7 1.4 2.5 2.2
>> 1.9 2.5 1.7 1.7 0.1
>> Avg: 1.8, Min: 0.1, Max: 2.5]
>> [Processed Buffers : 8 1 3 1 1 7 3 2 6 2 7 8 3
>> Sum: 52, Avg: 4, Min: 1, Max: 8]
>> [Ext Root Scanning (ms): 0.7 0.5 0.5 0.3 0.5 0.7 0.4
>> 0.5 0.4 0.5 0.3 0.3 0.0
>> Avg: 0.4, Min: 0.0, Max: 0.7]
>> [Mark Stack Scanning (ms): 0.0 0.0 0.0 0.0 0.0 0.0
>> 0.0 0.0 0.0 0.0 0.0 0.0 0.0
>> Avg: 0.0, Min: 0.0, Max: 0.0]
>> [Scan RS (ms): 0.9 0.4 0.1 0.7 0.7 0.9 0.0 0.1 0.6
>> 0.0 0.8 0.8 0.9
>> Avg: 0.5, Min: 0.0, Max: 0.9]
>> [Object Copy (ms): 7.2 7.2 7.3 7.3 7.1 7.1 7.0 7.2
>> 7.1 6.9 7.1 7.1 7.0
>> Avg: 7.1, Min: 6.9, Max: 7.3]
>> [Termination (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
>> 0.0 0.0 0.0 0.0 0.0
>> Avg: 0.0, Min: 0.0, Max: 0.0]
>> [Other: 0.6 ms]
>> [Clear CT: 0.5 ms]
>> [Other: 1.1 ms]
>> [Choose CSet: 0.0 ms]
>> [ 7677M->7636M(8000M)]
>> [Times: user=0.12 sys=0.00, real=0.01 secs]
>>
>> But every 5-10 minutes I see a GC pause that lasts 10-15 seconds:
>> [todd@monster01 logs]$ grep 'Full GC' gc-hbase.log | tail
>> 2010-07-06T12:50:41.216-0700: 52521.446: [Full GC
>> 7934M->4865M(8000M), 9.8907800 secs]
>> 2010-07-06T12:55:39.802-0700: 52820.032: [Full GC
>> 7930M->4964M(8000M), 9.9025520 secs]
>> 2010-07-06T13:02:26.872-0700: 53227.102: [Full GC
>> 7934M->4882M(8000M), 10.1232190 secs]
>> 2010-07-06T13:09:41.049-0700: 53661.279: [Full GC
>> 7938M->5002M(8000M), 10.4997760 secs]
>> 2010-07-06T13:18:51.531-0700: 54211.761: [Full GC
>> 7938M->4962M(8000M), 11.0497380 secs]
>>
>> These pauses are pretty unacceptable for soft real time operation.
>>
>> Am I missing some tuning that should be done for G1GC for
>> applications like this? Is 20ms out of 80ms too aggressive a
>> target for the garbage rates we're generating?
>>
>> My actual live heap usage should be very stable around 5GB - the
>> application very carefully accounts its live object set at around
>> 60% of the max heap (as you can see in the logs above).
>>
>> At this point we are considering doing crazy things like ripping
>> out our main memory consumers into a custom slab allocator, and
>> manually reference count the byte array slices. But, if we can get
>> G1GC to work for us, it will save a lot of engineering on the
>> application side!
>>
>> Thanks
>> -Todd
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>> ------------------------------------------------------------------------
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
On 07/07/10 08:45, Todd Lipcon wrote:
...
>
> Overnight I saw one "concurrent mode failure".
...
> 2010-07-07T07:56:27.786-0700: 28490.203: [GC 28490.203: [ParNew
> (promotion failed): 59008K->59008K(59008K), 0.0179250 secs]28490.221:
> [CMS2010-07-07T07:56:27.901-0700: 28490.317: [CMS-concurrent-preclean:
> 0.556/0.947 secs] [Times:
> user=5.76 sys=0.26, real=0.95 secs]
> (concurrent mode failure): 6359176K->4206871K(8323072K), 17.4366220
> secs] 6417373K->4206871K(8382080K), [CMS Perm : 18609K->18565K(31048K)],
> 17.4546890 secs] [Times: user=11.17 sys=0.09, real=17.45 secs]
>
> I've interpreted pauses like this as being caused by fragmentation,
> since the young gen is 64M, and the old gen here has about 2G free. If
> there's something I'm not understanding about CMS, and I can tune it
> more smartly to avoid these longer pauses, I'm happy to try.
Yes the old gen must be fragmented. I'll look at the data you have
made available (for CMS). The CMS log you uploaded does not have the
suffix leading into the concurrent mode failure ypu display above
(it stops less than 2500 s into the run). If you could include
the entire log leading into the concurrent mode failures, it would
be a great help. Do you have large arrays in your
application? The shape of the promotion graph for CMS is somewhat
jagged, indicating _perhaps_ that. Yes, +PrintTenuringDistribution
would shed a bit more light. As regards fragmentation, it can be
tricky to tune against, but we can try once we understand a bit
more about the object sizes and demographics.
I am sure you don't have an easily shared test case, so we
can reproduce both the CMS fragmentation and the G1 full gc
issues locally for quickest progress on this?
-- ramki
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
On 07/07/10 17:18, Todd Lipcon wrote:
...
> Looking at the graph you attached, it appears that the low-water mark
> stabilizes at somewhere between 4.5G and 5G. The configuration I'm
> running is to allocate 40% of the heap to Memstore and 20% of the heap
> to the LRU cache. For an 8G heap, this is 4.8GB. So, for this
> application it's somewhat expected that, as it runs, it will accumulate
> more and more data until it reaches this threshold. The data is, of
> course, not *permanent*, but it's reasonably long-lived, so it makes
> sense to me that it should go into the old generation.
Ah, i see. In that case, i think you could try using a slightly larger old
gen. If the old gen stabilizes at 4.2 GB, we should allow as much for slop.
i.e. make the old gen 8.4 GB (or whatever is the measured stable
old gen occupancy), then add to that the young gen size, and use
that for the whole heap. I would be even more aggressive
and grant more to the old gen -- as i said earlier perhaps
double the old gen from its present size. If that doesn;t work
we know that something is amiss in the way we are going at this.
If it works, we can iterate downwards from a config that we know
works, down to what may be considered an acceptable space overhead
for GC.
>
> If you like, I can tune down those percentages to 20/20 instead of
> 20/40, and I think we'll see the same pattern, just stabilized around
> 3.2GB. This will probably delay the full GCs, but still eventually hit
> them. It's also way lower than we can really go - customers won't like
> "throwing away" 60% of the allocated heap to GC!
I understand that sentiment. I want us to get to a state where we are able
to completely avoid the creeping fragmentation, if possible. There are
other ways to tune for this, but they are more labour-intensive and tricky,
and I would not want to go into that lightly. You might want to contact
your Java support for help with that.
>
>
> Perhaps jmap -histo:live or +PrintClassHistogram[AfterFullGC]
> will help you get to the bottom of yr leak. Once the leak is plugged
> perhaps we could come back to the G1 tuning effort? (We have some
> guesses as to what might be happening and the best G1 minds are
> chewing on the info you provided so far, for which thanks!)
>
>
> I can try running with those options and see what I see, but I've
> already spent some time looking at heap dumps, and not found any leaks,
> so I'm pretty sure it's not the issue.
OK, in that case it's not worth doing, since you've already ruled
out leaks.
I'll think some more about this meanwhile.
thanks.
-- ramki
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
On 07/07/10 17:32, Todd Lipcon wrote:
...
> OK, I can try some tests with cache configured for only 40% heap usage.
> Should I run these tests with CMS or G1?
I'd first try CMS, and if that works, try G1.
>
>
>
>
> If you like, I can tune down those percentages to 20/20 instead
> of 20/40, and I think we'll see the same pattern, just
> stabilized around 3.2GB. This will probably delay the full GCs,
> but still eventually hit them. It's also way lower than we can
> really go - customers won't like "throwing away" 60% of the
> allocated heap to GC!
>
>
> I understand that sentiment. I want us to get to a state where we
> are able
> to completely avoid the creeping fragmentation, if possible. There are
> other ways to tune for this, but they are more labour-intensive and
> tricky,
> and I would not want to go into that lightly. You might want to contact
> your Java support for help with that.
>
>
> Yep, we've considered various solutions involving managing our own
> ref-counted slices of a single pre-allocated byte array - essentially
> writing our own slab allocator. In theory this should make all of the
> GCable objects constrained to a small number of sizes, and thus prevent
> fragmentation, but it's quite a project to undertake :)
That would be overdoing it. I didn't mean anything so drastic and certainly
nothing so drastic at the application level. When I said "labour intensive"
I meant tuning GC to avoid that kind of fragmentation would be more work.
>
> Regarding Java support, as an open source project we have no such
> luxury. Projects like HBase and Hadoop, though, are pretty visible to
> users as "big Java apps", so getting them working well on the GC front
> does good things for Java adoption in the database/distributed systems
> community, I think.
I agree, and we certainly should.
-- ramki
>
> -Todd
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>
>
> ------------------------------------------------------------------------
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
> Am I missing some tuning that should be done for G1GC for applications like
> this? Is 20ms out of 80ms too aggressive a target for the garbage rates
> we're generating?
I have never run HBase, but in an LRU stress test (I posted about it a
few months ago) I specifically observed remembered set scanning costs
go way up. In addition I was seeing fallbacks to full GC:s recently in
a slightly different test that I also posed about to -use, and that
turned out to be a result of the estimated rset scanning costs being
so high that regions were never selected for eviction even though they
had very little live data. I would be very interested to hear if
you're having the same problem. My last post on the topic is here:
http://mail.openjdk.java.net/pipermail/hotspot-gc-use/2010-June/000652.html
Including the link to the (throw-away) patch that should tell you
whether this is what's happening:
http://distfiles.scode.org/mlref/g1/g1_region_live_stats_hack.patch
Out of personal curiosity I'd be very interested to hear whether this
is what's happening to you (in a real reasonable use-case rather than
a synthetic benchmark).
My sense (and hotspot/g1 developers please smack me around if I am
misrepresenting anything here) is that the effect I saw (with rset
scanning costs) could cause perpetual memory grow (until fallback to
full GC) in two ways:
(1) The estimated (and possibly real) cost of rset scanning for a
single region could be so high that it is never possible to select it
for eviction given the asked for pause time goals. Hence, such a
region effectively "leaks" until full GC.
(2) The estimated (and possibly real) cost of rset scanning for
regions may be so high that there are, in practice, always other
regions selected for high pay-off/cost ratios, such that they end up
never being collected even if theoretically a single region could be
evicted within the pause time goal.
These are effectively the same thing, with (1) being an extreme case of (2).
In both cases, the effect should be mitigated (and have been in the
case where I did my testing), but as far as I can tell not generally
"fixed", by increasing the pause time goals.
It is unclear to me how this is intended to be handled. The original
g1 paper mentions an rset scanning thread that I may suspect would be
intended to help do rset scanning in the background such that regions
like these could be evicted more cheaply during the STW eviction
pause; but I didn't find such a thread anywhere in the source code -
but I may very well just be missing it.
--
/ Peter Schuller
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
|
# 9

12-07-2010 08:43 PM
|
|
|
Todd,
Could you send a segment of the GC logs from the beginning
through the first dozen or so full GC's?
Exactly which version of the JVM are you using?
java -version
will tell us.
Do you have a test setup where you could do some experiments?
Can you send the set of CMS flags you use? It might tell
us something about the GC behavior of you application.
Might not tell us anything but it's worth a look.
Jon
On 07/06/10 13:27, Todd Lipcon wrote:
> Hi all,
>
> I work on HBase, a distributed database written in Java. We generally
> run on large heaps (8GB+), and our object lifetime distribution has
> proven pretty problematic for garbage collection (we manage a multi-GB
> LRU cache inside the process, so in CMS we tenure a lot of byte arrays
> which later get collected).
>
> In Java6, we generally run with the CMS collector, tuning down
> CMSInitiatingOccupancyFraction and constraining MaxNewSize to achieve
> fairly low pause GC, but after a week or two of uptime we often run
> into full heap compaction which takes several minutes and wreaks havoc
> on the system.
>
> Needless to say, we've been watching the development of the G1 GC with
> anticipation for the last year or two. Finally it seems in the latest
> build of JDK7 it's stable enough for actual use (previously it would
> segfault within an hour or so). However, in my testing I'm seeing a
> fair amount of 8-10 second Full GC pauses.
>
> The flags I'm passing are:
>
> -Xmx8000m -XX:+UseG1GC -XX:MaxGCPauseMillis=20
> -XX:GCPauseIntervalMillis=80
>
> Most of the pauses I see in the GC log are around 10-20ms as expected:
>
> 2010-07-05T22:43:19.849-0700: 1680.079: [GC pause (partial),
> 0.01209000 secs]
> [Parallel Time: 10.5 ms]
> [Update RS (Start) (ms): 1680080.2 1680080.1 1680080.2
> 1680079.9 1680080.0 1680080.2 1680080.1 1680080.1 1680080.0
> 1680080.1 1680080.0 1680079.9 1680081.5]
> [Update RS (ms): 1.4 2.0 2.2 1.8 1.7 1.4 2.5 2.2 1.9
> 2.5 1.7 1.7 0.1
> Avg: 1.8, Min: 0.1, Max: 2.5]
> [Processed Buffers : 8 1 3 1 1 7 3 2 6 2 7 8 3
> Sum: 52, Avg: 4, Min: 1, Max: 8]
> [Ext Root Scanning (ms): 0.7 0.5 0.5 0.3 0.5 0.7 0.4 0.5
> 0.4 0.5 0.3 0.3 0.0
> Avg: 0.4, Min: 0.0, Max: 0.7]
> [Mark Stack Scanning (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0
> 0.0 0.0 0.0 0.0 0.0 0.0
> Avg: 0.0, Min: 0.0, Max: 0.0]
> [Scan RS (ms): 0.9 0.4 0.1 0.7 0.7 0.9 0.0 0.1 0.6 0.0
> 0.8 0.8 0.9
> Avg: 0.5, Min: 0.0, Max: 0.9]
> [Object Copy (ms): 7.2 7.2 7.3 7.3 7.1 7.1 7.0 7.2 7.1
> 6.9 7.1 7.1 7.0
> Avg: 7.1, Min: 6.9, Max: 7.3]
> [Termination (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
> 0.0 0.0 0.0 0.0
> Avg: 0.0, Min: 0.0, Max: 0.0]
> [Other: 0.6 ms]
> [Clear CT: 0.5 ms]
> [Other: 1.1 ms]
> [Choose CSet: 0.0 ms]
> [ 7677M->7636M(8000M)]
> [Times: user=0.12 sys=0.00, real=0.01 secs]
>
> But every 5-10 minutes I see a GC pause that lasts 10-15 seconds:
> [todd@monster01 logs]$ grep 'Full GC' gc-hbase.log | tail
> 2010-07-06T12:50:41.216-0700: 52521.446: [Full GC 7934M->4865M(8000M),
> 9.8907800 secs]
> 2010-07-06T12:55:39.802-0700: 52820.032: [Full GC 7930M->4964M(8000M),
> 9.9025520 secs]
> 2010-07-06T13:02:26.872-0700: 53227.102: [Full GC 7934M->4882M(8000M),
> 10.1232190 secs]
> 2010-07-06T13:09:41.049-0700: 53661.279: [Full GC 7938M->5002M(8000M),
> 10.4997760 secs]
> 2010-07-06T13:18:51.531-0700: 54211.761: [Full GC 7938M->4962M(8000M),
> 11.0497380 secs]
>
> These pauses are pretty unacceptable for soft real time operation.
>
> Am I missing some tuning that should be done for G1GC for applications
> like this? Is 20ms out of 80ms too aggressive a target for the garbage
> rates we're generating?
>
> My actual live heap usage should be very stable around 5GB - the
> application very carefully accounts its live object set at around 60%
> of the max heap (as you can see in the logs above).
>
> At this point we are considering doing crazy things like ripping out
> our main memory consumers into a custom slab allocator, and manually
> reference count the byte array slices. But, if we can get G1GC to work
> for us, it will save a lot of engineering on the application side!
>
> Thanks
> -Todd
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
> ------------------------------------------------------------------------
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
Did you try doubling the heap size? You might want to post a full
log so we can see what's happening between those full collections.
Also, If you have comparable CMS logs
all the better, as a known starting point. The full gc's almost
look like the heap got too full, so it must mean that incremental
collection is not keeping up with the rate of garbage generation.
Also, what's the JDK build you are running?
-- ramki
On 07/06/10 13:27, Todd Lipcon wrote:
> Hi all,
>
> I work on HBase, a distributed database written in Java. We generally
> run on large heaps (8GB+), and our object lifetime distribution has
> proven pretty problematic for garbage collection (we manage a multi-GB
> LRU cache inside the process, so in CMS we tenure a lot of byte arrays
> which later get collected).
>
> In Java6, we generally run with the CMS collector, tuning down
> CMSInitiatingOccupancyFraction and constraining MaxNewSize to achieve
> fairly low pause GC, but after a week or two of uptime we often run into
> full heap compaction which takes several minutes and wreaks havoc on the
> system.
>
> Needless to say, we've been watching the development of the G1 GC with
> anticipation for the last year or two. Finally it seems in the latest
> build of JDK7 it's stable enough for actual use (previously it would
> segfault within an hour or so). However, in my testing I'm seeing a fair
> amount of 8-10 second Full GC pauses.
>
> The flags I'm passing are:
>
> -Xmx8000m -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:GCPauseIntervalMillis=80
>
> Most of the pauses I see in the GC log are around 10-20ms as expected:
>
> 2010-07-05T22:43:19.849-0700: 1680.079: [GC pause (partial), 0.01209000
> secs]
> [Parallel Time: 10.5 ms]
> [Update RS (Start) (ms): 1680080.2 1680080.1 1680080.2
> 1680079.9 1680080.0 1680080.2 1680080.1 1680080.1 1680080.0
> 1680080.1 1680080.0 1680079.9 1680081.5]
> [Update RS (ms): 1.4 2.0 2.2 1.8 1.7 1.4 2.5 2.2 1.9 2.5
> 1.7 1.7 0.1
> Avg: 1.8, Min: 0.1, Max: 2.5]
> [Processed Buffers : 8 1 3 1 1 7 3 2 6 2 7 8 3
> Sum: 52, Avg: 4, Min: 1, Max: 8]
> [Ext Root Scanning (ms): 0.7 0.5 0.5 0.3 0.5 0.7 0.4 0.5
> 0.4 0.5 0.3 0.3 0.0
> Avg: 0.4, Min: 0.0, Max: 0.7]
> [Mark Stack Scanning (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
> 0.0 0.0 0.0 0.0 0.0
> Avg: 0.0, Min: 0.0, Max: 0.0]
> [Scan RS (ms): 0.9 0.4 0.1 0.7 0.7 0.9 0.0 0.1 0.6 0.0
> 0.8 0.8 0.9
> Avg: 0.5, Min: 0.0, Max: 0.9]
> [Object Copy (ms): 7.2 7.2 7.3 7.3 7.1 7.1 7.0 7.2 7.1
> 6.9 7.1 7.1 7.0
> Avg: 7.1, Min: 6.9, Max: 7.3]
> [Termination (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
> 0.0 0.0 0.0 0.0
> Avg: 0.0, Min: 0.0, Max: 0.0]
> [Other: 0.6 ms]
> [Clear CT: 0.5 ms]
> [Other: 1.1 ms]
> [Choose CSet: 0.0 ms]
> [ 7677M->7636M(8000M)]
> [Times: user=0.12 sys=0.00, real=0.01 secs]
>
> But every 5-10 minutes I see a GC pause that lasts 10-15 seconds:
> [todd@monster01 logs]$ grep 'Full GC' gc-hbase.log | tail
> 2010-07-06T12:50:41.216-0700: 52521.446: [Full GC 7934M->4865M(8000M),
> 9.8907800 secs]
> 2010-07-06T12:55:39.802-0700: 52820.032: [Full GC 7930M->4964M(8000M),
> 9.9025520 secs]
> 2010-07-06T13:02:26.872-0700: 53227.102: [Full GC 7934M->4882M(8000M),
> 10.1232190 secs]
> 2010-07-06T13:09:41.049-0700: 53661.279: [Full GC 7938M->5002M(8000M),
> 10.4997760 secs]
> 2010-07-06T13:18:51.531-0700: 54211.761: [Full GC 7938M->4962M(8000M),
> 11.0497380 secs]
>
> These pauses are pretty unacceptable for soft real time operation.
>
> Am I missing some tuning that should be done for G1GC for applications
> like this? Is 20ms out of 80ms too aggressive a target for the garbage
> rates we're generating?
>
> My actual live heap usage should be very stable around 5GB - the
> application very carefully accounts its live object set at around 60% of
> the max heap (as you can see in the logs above).
>
> At this point we are considering doing crazy things like ripping out our
> main memory consumers into a custom slab allocator, and manually
> reference count the byte array slices. But, if we can get G1GC to work
> for us, it will save a lot of engineering on the application side!
>
> Thanks
> -Todd
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>
>
> ------------------------------------------------------------------------
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
I also work with Todd on this systems (I am one of the other people
with the alternate CMS config) and doubling the heap size from 8 GB to
16GB is a little insane... we'd like to have some amount of reasonable
memory efficiency here... The thing is the more we can get out of our
ram for this block cache, the better performing our systems are. Also
a lot of the settings are self tuning, so if we up the Xmx the size of
the block cache is scaled as well.
-ryan
On Tue, Jul 6, 2010 at 2:12 PM, Y. S. Ramakrishna
<> wrote:
> Did you try doubling the heap size? You might want to post a full
> log so we can see what's happening between those full collections.
> Also, If you have comparable CMS logs
> all the better, as a known starting point. The full gc's almost
> look like the heap got too full, so it must mean that incremental
> collection is not keeping up with the rate of garbage generation.
> Also, what's the JDK build you are running?
>
> -- ramki
>
> On 07/06/10 13:27, Todd Lipcon wrote:
>> Hi all,
>>
>> I work on HBase, a distributed database written in Java. We generally
>> run on large heaps (8GB+), and our object lifetime distribution has
>> proven pretty problematic for garbage collection (we manage a multi-GB
>> LRU cache inside the process, so in CMS we tenure a lot of byte arrays
>> which later get collected).
>>
>> In Java6, we generally run with the CMS collector, tuning down
>> CMSInitiatingOccupancyFraction and constraining MaxNewSize to achieve
>> fairly low pause GC, but after a week or two of uptime we often run into
>> full heap compaction which takes several minutes and wreaks havoc on the
>> system.
>>
>> Needless to say, we've been watching the development of the G1 GC with
>> anticipation for the last year or two. Finally it seems in the latest
>> build of JDK7 it's stable enough for actual use (previously it would
>> segfault within an hour or so). However, in my testing I'm seeing a fair
>> amount of 8-10 second Full GC pauses.
>>
>> The flags I'm passing are:
>>
>> -Xmx8000m -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:GCPauseIntervalMillis=80
>>
>> Most of the pauses I see in the GC log are around 10-20ms as expected:
>>
>> 2010-07-05T22:43:19.849-0700: 1680.079: [GC pause (partial), 0.01209000
>> secs]
>> [Parallel Time: 10.5 ms]
>> [Update RS (Start) (ms): 1680080.2 1680080.1 1680080.2
>> 1680079.9 1680080.0 1680080.2 1680080.1 1680080.1 1680080.0
>> 1680080.1 1680080.0 1680079.9 1680081.5]
>> [Update RS (ms): 1.4 2.0 2.2 1.8 1.7 1.4 2.5 2.2 1.9 2.5
>> 1.7 1.7 0.1
>> Avg: 1.8, Min: 0.1, Max: 2.5]
>> [Processed Buffers : 8 1 3 1 1 7 3 2 6 2 7 8 3
>> Sum: 52, Avg: 4, Min: 1, Max: 8]
>> [Ext Root Scanning (ms): 0.7 0.5 0.5 0.3 0.5 0.7 0.4 0.5
>> 0.4 0.5 0.3 0.3 0.0
>> Avg: 0.4, Min: 0.0, Max: 0.7]
>> [Mark Stack Scanning (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
>> 0.0 0.0 0.0 0.0 0.0
>> Avg: 0.0, Min: 0.0, Max: 0.0]
>> [Scan RS (ms): 0.9 0.4 0.1 0.7 0.7 0.9 0.0 0.1 0.6 0.0
>> 0.8 0.8 0.9
>> Avg: 0.5, Min: 0.0, Max: 0.9]
>> [Object Copy (ms): 7.2 7.2 7.3 7.3 7.1 7.1 7.0 7.2 7.1
>> 6.9 7.1 7.1 7.0
>> Avg: 7.1, Min: 6.9, Max: 7.3]
>> [Termination (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
>> 0.0 0.0 0.0 0.0
>> Avg: 0.0, Min: 0.0, Max: 0.0]
>> [Other: 0.6 ms]
>> [Clear CT: 0.5 ms]
>> [Other: 1.1 ms]
>> [Choose CSet: 0.0 ms]
>> [ 7677M->7636M(8000M)]
>> [Times: user=0.12 sys=0.00, real=0.01 secs]
>>
>> But every 5-10 minutes I see a GC pause that lasts 10-15 seconds:
>> [todd@monster01 logs]$ grep 'Full GC' gc-hbase.log | tail
>> 2010-07-06T12:50:41.216-0700: 52521.446: [Full GC 7934M->4865M(8000M),
>> 9.8907800 secs]
>> 2010-07-06T12:55:39.802-0700: 52820.032: [Full GC 7930M->4964M(8000M),
>> 9.9025520 secs]
>> 2010-07-06T13:02:26.872-0700: 53227.102: [Full GC 7934M->4882M(8000M),
>> 10.1232190 secs]
>> 2010-07-06T13:09:41.049-0700: 53661.279: [Full GC 7938M->5002M(8000M),
>> 10.4997760 secs]
>> 2010-07-06T13:18:51.531-0700: 54211.761: [Full GC 7938M->4962M(8000M),
>> 11.0497380 secs]
>>
>> These pauses are pretty unacceptable for soft real time operation.
>>
>> Am I missing some tuning that should be done for G1GC for applications
>> like this? Is 20ms out of 80ms too aggressive a target for the garbage
>> rates we're generating?
>>
>> My actual live heap usage should be very stable around 5GB - the
>> application very carefully accounts its live object set at around 60% of
>> the max heap (as you can see in the logs above).
>>
>> At this point we are considering doing crazy things like ripping out our
>> main memory consumers into a custom slab allocator, and manually
>> reference count the byte array slices. But, if we can get G1GC to work
>> for us, it will save a lot of engineering on the application side!
>>
>> Thanks
>> -Todd
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
>>
>> ------------------------------------------------------------------------
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
Two questions:
(1) do you enable class unloading with CMS? (I do not see that
below in yr option list, but wondered.)
(2) does your application load classes, or intern a lot of strings?
If i am reading the logs right, G1 appears to reclaim less
and less of the heap in each cycle until a full collection
intervenes, and I have no real explanation for this behaviour
except that perhaps there's something in the perm gen that
keeps stuff in the remainder of the heap artificially live.
G1 does not incrementally collect the young gen, so this is
plausible. But CMS does not either by default and I do not see
that option in the CMS options list you gave below. It would
be instructive to see what the comparable CMS logs look like.
May be then you could start with the same heap shapes for the
two and see if you can get to the bottom of the full gc (which
as i understand you get to more quickly w/G1 than you did
w/CMS).
-- ramki
On 07/06/10 14:24, Todd Lipcon wrote:
> On Tue, Jul 6, 2010 at 2:09 PM, Jon Masamitsu <
> > wrote:
>
> Todd,
>
> Could you send a segment of the GC logs from the beginning
> through the first dozen or so full GC's?
>
>
> Sure, I just put it online at:
>
> http://cloudera-todd.s3.amazonaws.com/gc-log-g1gc.txt
>
>
>
> Exactly which version of the JVM are you using?
>
> java -version
>
> will tell us.
>
>
> Latest as of last night:
>
> [todd@monster01 ~]$ ./jdk1.7.0/jre/bin/java -version
> java version "1.7.0-ea"
> Java(TM) SE Runtime Environment (build 1.7.0-ea-b99)
> Java HotSpot(TM) 64-Bit Server VM (build 19.0-b03, mixed mode)
>
>
> Do you have a test setup where you could do some experiments?
>
>
> Sure, I have a five node cluster here where I do lots of testing, happy
> to try different builds/options/etc (though I probably don't have time
> to apply patches and rebuild the JDK myself)
>
>
> Can you send the set of CMS flags you use? It might tell
> us something about the GC behavior of you application.
> Might not tell us anything but it's worth a look.
>
>
> Different customers have found different flags to work well for them.
> One user uses the following:
>
>
> -Xmx12000m -XX:+HeapDumpOnOutOfMemoryError -XX:+UseConcMarkSweepGC \
> -XX:NewSize=64m -XX:MaxNewSize=64m -XX:CMSInitiatingOccupancyFraction=40 \
>
>
> Another uses:
>
>
> -XX:+DoEscapeAnalysis -XX:+AggressiveOpts -XX:+UseConcMarkSweepGC
> -XX:NewSize=64m -XX:MaxNewSize=64m
> -XX:CMSInitiatingOccupancyFraction=88 -verbose:gc -XX:+PrintGCDetails
>
>
>
>
>
>
>
> The particular tuning options probably depend on the actual cache
> workload of the user. I tend to recommend CMSInitiatingOccupancyFraction
> around 75 or so, since the software maintains about 60% heap usage. I
> also think a NewSize slightly larger would improve things a bit, but if
> it gets more than 256m or so, the ParNew pauses start to be too long for
> a lot of use cases.
>
> Regarding CMS logs, I can probably restart this test later this
> afternoon on CMS and run it for a couple hours, but it isn't likely to
> hit the multi-minute compaction that quickly. It happens more in the wild.
>
> -Todd
>
>
>
> On 07/06/10 13:27, Todd Lipcon wrote:
>> Hi all,
>>
>> I work on HBase, a distributed database written in Java. We
>> generally run on large heaps (8GB+), and our object lifetime
>> distribution has proven pretty problematic for garbage collection
>> (we manage a multi-GB LRU cache inside the process, so in CMS we
>> tenure a lot of byte arrays which later get collected).
>>
>> In Java6, we generally run with the CMS collector, tuning down
>> CMSInitiatingOccupancyFraction and constraining MaxNewSize to
>> achieve fairly low pause GC, but after a week or two of uptime we
>> often run into full heap compaction which takes several minutes
>> and wreaks havoc on the system.
>>
>> Needless to say, we've been watching the development of the G1 GC
>> with anticipation for the last year or two. Finally it seems in
>> the latest build of JDK7 it's stable enough for actual use
>> (previously it would segfault within an hour or so). However, in
>> my testing I'm seeing a fair amount of 8-10 second Full GC pauses.
>>
>> The flags I'm passing are:
>>
>> -Xmx8000m -XX:+UseG1GC -XX:MaxGCPauseMillis=20
>> -XX:GCPauseIntervalMillis=80
>>
>> Most of the pauses I see in the GC log are around 10-20ms as expected:
>>
>> 2010-07-05T22:43:19.849-0700: 1680.079: [GC pause (partial),
>> 0.01209000 secs]
>> [Parallel Time: 10.5 ms]
>> [Update RS (Start) (ms): 1680080.2 1680080.1 1680080.2
>> 1680079.9 1680080.0 1680080.2 1680080.1 1680080.1 1680080.0
>> 1680080.1 1680080.0 1680079.9 1680081.5]
>> [Update RS (ms): 1.4 2.0 2.2 1.8 1.7 1.4 2.5 2.2
>> 1.9 2.5 1.7 1.7 0.1
>> Avg: 1.8, Min: 0.1, Max: 2.5]
>> [Processed Buffers : 8 1 3 1 1 7 3 2 6 2 7 8 3
>> Sum: 52, Avg: 4, Min: 1, Max: 8]
>> [Ext Root Scanning (ms): 0.7 0.5 0.5 0.3 0.5 0.7 0.4
>> 0.5 0.4 0.5 0.3 0.3 0.0
>> Avg: 0.4, Min: 0.0, Max: 0.7]
>> [Mark Stack Scanning (ms): 0.0 0.0 0.0 0.0 0.0 0.0
>> 0.0 0.0 0.0 0.0 0.0 0.0 0.0
>> Avg: 0.0, Min: 0.0, Max: 0.0]
>> [Scan RS (ms): 0.9 0.4 0.1 0.7 0.7 0.9 0.0 0.1 0.6
>> 0.0 0.8 0.8 0.9
>> Avg: 0.5, Min: 0.0, Max: 0.9]
>> [Object Copy (ms): 7.2 7.2 7.3 7.3 7.1 7.1 7.0 7.2
>> 7.1 6.9 7.1 7.1 7.0
>> Avg: 7.1, Min: 6.9, Max: 7.3]
>> [Termination (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
>> 0.0 0.0 0.0 0.0 0.0
>> Avg: 0.0, Min: 0.0, Max: 0.0]
>> [Other: 0.6 ms]
>> [Clear CT: 0.5 ms]
>> [Other: 1.1 ms]
>> [Choose CSet: 0.0 ms]
>> [ 7677M->7636M(8000M)]
>> [Times: user=0.12 sys=0.00, real=0.01 secs]
>>
>> But every 5-10 minutes I see a GC pause that lasts 10-15 seconds:
>> [todd@monster01 logs]$ grep 'Full GC' gc-hbase.log | tail
>> 2010-07-06T12:50:41.216-0700: 52521.446: [Full GC
>> 7934M->4865M(8000M), 9.8907800 secs]
>> 2010-07-06T12:55:39.802-0700: 52820.032: [Full GC
>> 7930M->4964M(8000M), 9.9025520 secs]
>> 2010-07-06T13:02:26.872-0700: 53227.102: [Full GC
>> 7934M->4882M(8000M), 10.1232190 secs]
>> 2010-07-06T13:09:41.049-0700: 53661.279: [Full GC
>> 7938M->5002M(8000M), 10.4997760 secs]
>> 2010-07-06T13:18:51.531-0700: 54211.761: [Full GC
>> 7938M->4962M(8000M), 11.0497380 secs]
>>
>> These pauses are pretty unacceptable for soft real time operation.
>>
>> Am I missing some tuning that should be done for G1GC for
>> applications like this? Is 20ms out of 80ms too aggressive a
>> target for the garbage rates we're generating?
>>
>> My actual live heap usage should be very stable around 5GB - the
>> application very carefully accounts its live object set at around
>> 60% of the max heap (as you can see in the logs above).
>>
>> At this point we are considering doing crazy things like ripping
>> out our main memory consumers into a custom slab allocator, and
>> manually reference count the byte array slices. But, if we can get
>> G1GC to work for us, it will save a lot of engineering on the
>> application side!
>>
>> Thanks
>> -Todd
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>> ------------------------------------------------------------------------
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
On 07/07/10 08:45, Todd Lipcon wrote:
...
>
> Overnight I saw one "concurrent mode failure".
...
> 2010-07-07T07:56:27.786-0700: 28490.203: [GC 28490.203: [ParNew
> (promotion failed): 59008K->59008K(59008K), 0.0179250 secs]28490.221:
> [CMS2010-07-07T07:56:27.901-0700: 28490.317: [CMS-concurrent-preclean:
> 0.556/0.947 secs] [Times:
> user=5.76 sys=0.26, real=0.95 secs]
> (concurrent mode failure): 6359176K->4206871K(8323072K), 17.4366220
> secs] 6417373K->4206871K(8382080K), [CMS Perm : 18609K->18565K(31048K)],
> 17.4546890 secs] [Times: user=11.17 sys=0.09, real=17.45 secs]
>
> I've interpreted pauses like this as being caused by fragmentation,
> since the young gen is 64M, and the old gen here has about 2G free. If
> there's something I'm not understanding about CMS, and I can tune it
> more smartly to avoid these longer pauses, I'm happy to try.
Yes the old gen must be fragmented. I'll look at the data you have
made available (for CMS). The CMS log you uploaded does not have the
suffix leading into the concurrent mode failure ypu display above
(it stops less than 2500 s into the run). If you could include
the entire log leading into the concurrent mode failures, it would
be a great help. Do you have large arrays in your
application? The shape of the promotion graph for CMS is somewhat
jagged, indicating _perhaps_ that. Yes, +PrintTenuringDistribution
would shed a bit more light. As regards fragmentation, it can be
tricky to tune against, but we can try once we understand a bit
more about the object sizes and demographics.
I am sure you don't have an easily shared test case, so we
can reproduce both the CMS fragmentation and the G1 full gc
issues locally for quickest progress on this?
-- ramki
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
On 07/07/10 17:18, Todd Lipcon wrote:
...
> Looking at the graph you attached, it appears that the low-water mark
> stabilizes at somewhere between 4.5G and 5G. The configuration I'm
> running is to allocate 40% of the heap to Memstore and 20% of the heap
> to the LRU cache. For an 8G heap, this is 4.8GB. So, for this
> application it's somewhat expected that, as it runs, it will accumulate
> more and more data until it reaches this threshold. The data is, of
> course, not *permanent*, but it's reasonably long-lived, so it makes
> sense to me that it should go into the old generation.
Ah, i see. In that case, i think you could try using a slightly larger old
gen. If the old gen stabilizes at 4.2 GB, we should allow as much for slop.
i.e. make the old gen 8.4 GB (or whatever is the measured stable
old gen occupancy), then add to that the young gen size, and use
that for the whole heap. I would be even more aggressive
and grant more to the old gen -- as i said earlier perhaps
double the old gen from its present size. If that doesn;t work
we know that something is amiss in the way we are going at this.
If it works, we can iterate downwards from a config that we know
works, down to what may be considered an acceptable space overhead
for GC.
>
> If you like, I can tune down those percentages to 20/20 instead of
> 20/40, and I think we'll see the same pattern, just stabilized around
> 3.2GB. This will probably delay the full GCs, but still eventually hit
> them. It's also way lower than we can really go - customers won't like
> "throwing away" 60% of the allocated heap to GC!
I understand that sentiment. I want us to get to a state where we are able
to completely avoid the creeping fragmentation, if possible. There are
other ways to tune for this, but they are more labour-intensive and tricky,
and I would not want to go into that lightly. You might want to contact
your Java support for help with that.
>
>
> Perhaps jmap -histo:live or +PrintClassHistogram[AfterFullGC]
> will help you get to the bottom of yr leak. Once the leak is plugged
> perhaps we could come back to the G1 tuning effort? (We have some
> guesses as to what might be happening and the best G1 minds are
> chewing on the info you provided so far, for which thanks!)
>
>
> I can try running with those options and see what I see, but I've
> already spent some time looking at heap dumps, and not found any leaks,
> so I'm pretty sure it's not the issue.
OK, in that case it's not worth doing, since you've already ruled
out leaks.
I'll think some more about this meanwhile.
thanks.
-- ramki
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
On 07/07/10 17:32, Todd Lipcon wrote:
...
> OK, I can try some tests with cache configured for only 40% heap usage.
> Should I run these tests with CMS or G1?
I'd first try CMS, and if that works, try G1.
>
>
>
>
> If you like, I can tune down those percentages to 20/20 instead
> of 20/40, and I think we'll see the same pattern, just
> stabilized around 3.2GB. This will probably delay the full GCs,
> but still eventually hit them. It's also way lower than we can
> really go - customers won't like "throwing away" 60% of the
> allocated heap to GC!
>
>
> I understand that sentiment. I want us to get to a state where we
> are able
> to completely avoid the creeping fragmentation, if possible. There are
> other ways to tune for this, but they are more labour-intensive and
> tricky,
> and I would not want to go into that lightly. You might want to contact
> your Java support for help with that.
>
>
> Yep, we've considered various solutions involving managing our own
> ref-counted slices of a single pre-allocated byte array - essentially
> writing our own slab allocator. In theory this should make all of the
> GCable objects constrained to a small number of sizes, and thus prevent
> fragmentation, but it's quite a project to undertake :)
That would be overdoing it. I didn't mean anything so drastic and certainly
nothing so drastic at the application level. When I said "labour intensive"
I meant tuning GC to avoid that kind of fragmentation would be more work.
>
> Regarding Java support, as an open source project we have no such
> luxury. Projects like HBase and Hadoop, though, are pretty visible to
> users as "big Java apps", so getting them working well on the GC front
> does good things for Java adoption in the database/distributed systems
> community, I think.
I agree, and we certainly should.
-- ramki
>
> -Todd
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>
>
> ------------------------------------------------------------------------
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
> Am I missing some tuning that should be done for G1GC for applications like
> this? Is 20ms out of 80ms too aggressive a target for the garbage rates
> we're generating?
I have never run HBase, but in an LRU stress test (I posted about it a
few months ago) I specifically observed remembered set scanning costs
go way up. In addition I was seeing fallbacks to full GC:s recently in
a slightly different test that I also posed about to -use, and that
turned out to be a result of the estimated rset scanning costs being
so high that regions were never selected for eviction even though they
had very little live data. I would be very interested to hear if
you're having the same problem. My last post on the topic is here:
http://mail.openjdk.java.net/pipermail/hotspot-gc-use/2010-June/000652.html
Including the link to the (throw-away) patch that should tell you
whether this is what's happening:
http://distfiles.scode.org/mlref/g1/g1_region_live_stats_hack.patch
Out of personal curiosity I'd be very interested to hear whether this
is what's happening to you (in a real reasonable use-case rather than
a synthetic benchmark).
My sense (and hotspot/g1 developers please smack me around if I am
misrepresenting anything here) is that the effect I saw (with rset
scanning costs) could cause perpetual memory grow (until fallback to
full GC) in two ways:
(1) The estimated (and possibly real) cost of rset scanning for a
single region could be so high that it is never possible to select it
for eviction given the asked for pause time goals. Hence, such a
region effectively "leaks" until full GC.
(2) The estimated (and possibly real) cost of rset scanning for
regions may be so high that there are, in practice, always other
regions selected for high pay-off/cost ratios, such that they end up
never being collected even if theoretically a single region could be
evicted within the pause time goal.
These are effectively the same thing, with (1) being an extreme case of (2).
In both cases, the effect should be mitigated (and have been in the
case where I did my testing), but as far as I can tell not generally
"fixed", by increasing the pause time goals.
It is unclear to me how this is intended to be handled. The original
g1 paper mentions an rset scanning thread that I may suspect would be
intended to help do rset scanning in the background such that regions
like these could be evicted more cheaply during the STW eviction
pause; but I didn't find such a thread anywhere in the source code -
but I may very well just be missing it.
--
/ Peter Schuller
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
Hi Peter --
Yes, my guess was also that something (possibly along the lines
you stated below) was preventing the selection of certain (sets
of) regions for evacuation on a regular basis ... I am told there
are flags that will allow you to get verbose details on what is
or is not selected for inclusion in the collection set; perhaps
that will help you get down to the bottom of this. Did you say
you had a test case that showed this behaviour? Filing a bug
with that test case may be the quickest way to get this before
the right set of eyes. Over to the G1 cognoscenti.
-- ramki
On 07/12/10 09:02, Peter Schuller wrote:
>> Am I missing some tuning that should be done for G1GC for applications like
>> this? Is 20ms out of 80ms too aggressive a target for the garbage rates
>> we're generating?
>
> I have never run HBase, but in an LRU stress test (I posted about it a
> few months ago) I specifically observed remembered set scanning costs
> go way up. In addition I was seeing fallbacks to full GC:s recently in
> a slightly different test that I also posed about to -use, and that
> turned out to be a result of the estimated rset scanning costs being
> so high that regions were never selected for eviction even though they
> had very little live data. I would be very interested to hear if
> you're having the same problem. My last post on the topic is here:
>
> http://mail.openjdk.java.net/pipermail/hotspot-gc-use/2010-June/000652.html
>
> Including the link to the (throw-away) patch that should tell you
> whether this is what's happening:
>
> http://distfiles.scode.org/mlref/g1/g1_region_live_stats_hack.patch
>
> Out of personal curiosity I'd be very interested to hear whether this
> is what's happening to you (in a real reasonable use-case rather than
> a synthetic benchmark).
>
> My sense (and hotspot/g1 developers please smack me around if I am
> misrepresenting anything here) is that the effect I saw (with rset
> scanning costs) could cause perpetual memory grow (until fallback to
> full GC) in two ways:
>
> (1) The estimated (and possibly real) cost of rset scanning for a
> single region could be so high that it is never possible to select it
> for eviction given the asked for pause time goals. Hence, such a
> region effectively "leaks" until full GC.
>
> (2) The estimated (and possibly real) cost of rset scanning for
> regions may be so high that there are, in practice, always other
> regions selected for high pay-off/cost ratios, such that they end up
> never being collected even if theoretically a single region could be
> evicted within the pause time goal.
>
> These are effectively the same thing, with (1) being an extreme case of (2).
>
> In both cases, the effect should be mitigated (and have been in the
> case where I did my testing), but as far as I can tell not generally
> "fixed", by increasing the pause time goals.
>
> It is unclear to me how this is intended to be handled. The original
> g1 paper mentions an rset scanning thread that I may suspect would be
> intended to help do rset scanning in the background such that regions
> like these could be evicted more cheaply during the STW eviction
> pause; but I didn't find such a thread anywhere in the source code -
> but I may very well just be missing it.
>
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
|
# 10

13-07-2010 07:43 PM
|
|
|
Todd,
Could you send a segment of the GC logs from the beginning
through the first dozen or so full GC's?
Exactly which version of the JVM are you using?
java -version
will tell us.
Do you have a test setup where you could do some experiments?
Can you send the set of CMS flags you use? It might tell
us something about the GC behavior of you application.
Might not tell us anything but it's worth a look.
Jon
On 07/06/10 13:27, Todd Lipcon wrote:
> Hi all,
>
> I work on HBase, a distributed database written in Java. We generally
> run on large heaps (8GB+), and our object lifetime distribution has
> proven pretty problematic for garbage collection (we manage a multi-GB
> LRU cache inside the process, so in CMS we tenure a lot of byte arrays
> which later get collected).
>
> In Java6, we generally run with the CMS collector, tuning down
> CMSInitiatingOccupancyFraction and constraining MaxNewSize to achieve
> fairly low pause GC, but after a week or two of uptime we often run
> into full heap compaction which takes several minutes and wreaks havoc
> on the system.
>
> Needless to say, we've been watching the development of the G1 GC with
> anticipation for the last year or two. Finally it seems in the latest
> build of JDK7 it's stable enough for actual use (previously it would
> segfault within an hour or so). However, in my testing I'm seeing a
> fair amount of 8-10 second Full GC pauses.
>
> The flags I'm passing are:
>
> -Xmx8000m -XX:+UseG1GC -XX:MaxGCPauseMillis=20
> -XX:GCPauseIntervalMillis=80
>
> Most of the pauses I see in the GC log are around 10-20ms as expected:
>
> 2010-07-05T22:43:19.849-0700: 1680.079: [GC pause (partial),
> 0.01209000 secs]
> [Parallel Time: 10.5 ms]
> [Update RS (Start) (ms): 1680080.2 1680080.1 1680080.2
> 1680079.9 1680080.0 1680080.2 1680080.1 1680080.1 1680080.0
> 1680080.1 1680080.0 1680079.9 1680081.5]
> [Update RS (ms): 1.4 2.0 2.2 1.8 1.7 1.4 2.5 2.2 1.9
> 2.5 1.7 1.7 0.1
> Avg: 1.8, Min: 0.1, Max: 2.5]
> [Processed Buffers : 8 1 3 1 1 7 3 2 6 2 7 8 3
> Sum: 52, Avg: 4, Min: 1, Max: 8]
> [Ext Root Scanning (ms): 0.7 0.5 0.5 0.3 0.5 0.7 0.4 0.5
> 0.4 0.5 0.3 0.3 0.0
> Avg: 0.4, Min: 0.0, Max: 0.7]
> [Mark Stack Scanning (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0
> 0.0 0.0 0.0 0.0 0.0 0.0
> Avg: 0.0, Min: 0.0, Max: 0.0]
> [Scan RS (ms): 0.9 0.4 0.1 0.7 0.7 0.9 0.0 0.1 0.6 0.0
> 0.8 0.8 0.9
> Avg: 0.5, Min: 0.0, Max: 0.9]
> [Object Copy (ms): 7.2 7.2 7.3 7.3 7.1 7.1 7.0 7.2 7.1
> 6.9 7.1 7.1 7.0
> Avg: 7.1, Min: 6.9, Max: 7.3]
> [Termination (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
> 0.0 0.0 0.0 0.0
> Avg: 0.0, Min: 0.0, Max: 0.0]
> [Other: 0.6 ms]
> [Clear CT: 0.5 ms]
> [Other: 1.1 ms]
> [Choose CSet: 0.0 ms]
> [ 7677M->7636M(8000M)]
> [Times: user=0.12 sys=0.00, real=0.01 secs]
>
> But every 5-10 minutes I see a GC pause that lasts 10-15 seconds:
> [todd@monster01 logs]$ grep 'Full GC' gc-hbase.log | tail
> 2010-07-06T12:50:41.216-0700: 52521.446: [Full GC 7934M->4865M(8000M),
> 9.8907800 secs]
> 2010-07-06T12:55:39.802-0700: 52820.032: [Full GC 7930M->4964M(8000M),
> 9.9025520 secs]
> 2010-07-06T13:02:26.872-0700: 53227.102: [Full GC 7934M->4882M(8000M),
> 10.1232190 secs]
> 2010-07-06T13:09:41.049-0700: 53661.279: [Full GC 7938M->5002M(8000M),
> 10.4997760 secs]
> 2010-07-06T13:18:51.531-0700: 54211.761: [Full GC 7938M->4962M(8000M),
> 11.0497380 secs]
>
> These pauses are pretty unacceptable for soft real time operation.
>
> Am I missing some tuning that should be done for G1GC for applications
> like this? Is 20ms out of 80ms too aggressive a target for the garbage
> rates we're generating?
>
> My actual live heap usage should be very stable around 5GB - the
> application very carefully accounts its live object set at around 60%
> of the max heap (as you can see in the logs above).
>
> At this point we are considering doing crazy things like ripping out
> our main memory consumers into a custom slab allocator, and manually
> reference count the byte array slices. But, if we can get G1GC to work
> for us, it will save a lot of engineering on the application side!
>
> Thanks
> -Todd
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
> ------------------------------------------------------------------------
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
Did you try doubling the heap size? You might want to post a full
log so we can see what's happening between those full collections.
Also, If you have comparable CMS logs
all the better, as a known starting point. The full gc's almost
look like the heap got too full, so it must mean that incremental
collection is not keeping up with the rate of garbage generation.
Also, what's the JDK build you are running?
-- ramki
On 07/06/10 13:27, Todd Lipcon wrote:
> Hi all,
>
> I work on HBase, a distributed database written in Java. We generally
> run on large heaps (8GB+), and our object lifetime distribution has
> proven pretty problematic for garbage collection (we manage a multi-GB
> LRU cache inside the process, so in CMS we tenure a lot of byte arrays
> which later get collected).
>
> In Java6, we generally run with the CMS collector, tuning down
> CMSInitiatingOccupancyFraction and constraining MaxNewSize to achieve
> fairly low pause GC, but after a week or two of uptime we often run into
> full heap compaction which takes several minutes and wreaks havoc on the
> system.
>
> Needless to say, we've been watching the development of the G1 GC with
> anticipation for the last year or two. Finally it seems in the latest
> build of JDK7 it's stable enough for actual use (previously it would
> segfault within an hour or so). However, in my testing I'm seeing a fair
> amount of 8-10 second Full GC pauses.
>
> The flags I'm passing are:
>
> -Xmx8000m -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:GCPauseIntervalMillis=80
>
> Most of the pauses I see in the GC log are around 10-20ms as expected:
>
> 2010-07-05T22:43:19.849-0700: 1680.079: [GC pause (partial), 0.01209000
> secs]
> [Parallel Time: 10.5 ms]
> [Update RS (Start) (ms): 1680080.2 1680080.1 1680080.2
> 1680079.9 1680080.0 1680080.2 1680080.1 1680080.1 1680080.0
> 1680080.1 1680080.0 1680079.9 1680081.5]
> [Update RS (ms): 1.4 2.0 2.2 1.8 1.7 1.4 2.5 2.2 1.9 2.5
> 1.7 1.7 0.1
> Avg: 1.8, Min: 0.1, Max: 2.5]
> [Processed Buffers : 8 1 3 1 1 7 3 2 6 2 7 8 3
> Sum: 52, Avg: 4, Min: 1, Max: 8]
> [Ext Root Scanning (ms): 0.7 0.5 0.5 0.3 0.5 0.7 0.4 0.5
> 0.4 0.5 0.3 0.3 0.0
> Avg: 0.4, Min: 0.0, Max: 0.7]
> [Mark Stack Scanning (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
> 0.0 0.0 0.0 0.0 0.0
> Avg: 0.0, Min: 0.0, Max: 0.0]
> [Scan RS (ms): 0.9 0.4 0.1 0.7 0.7 0.9 0.0 0.1 0.6 0.0
> 0.8 0.8 0.9
> Avg: 0.5, Min: 0.0, Max: 0.9]
> [Object Copy (ms): 7.2 7.2 7.3 7.3 7.1 7.1 7.0 7.2 7.1
> 6.9 7.1 7.1 7.0
> Avg: 7.1, Min: 6.9, Max: 7.3]
> [Termination (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
> 0.0 0.0 0.0 0.0
> Avg: 0.0, Min: 0.0, Max: 0.0]
> [Other: 0.6 ms]
> [Clear CT: 0.5 ms]
> [Other: 1.1 ms]
> [Choose CSet: 0.0 ms]
> [ 7677M->7636M(8000M)]
> [Times: user=0.12 sys=0.00, real=0.01 secs]
>
> But every 5-10 minutes I see a GC pause that lasts 10-15 seconds:
> [todd@monster01 logs]$ grep 'Full GC' gc-hbase.log | tail
> 2010-07-06T12:50:41.216-0700: 52521.446: [Full GC 7934M->4865M(8000M),
> 9.8907800 secs]
> 2010-07-06T12:55:39.802-0700: 52820.032: [Full GC 7930M->4964M(8000M),
> 9.9025520 secs]
> 2010-07-06T13:02:26.872-0700: 53227.102: [Full GC 7934M->4882M(8000M),
> 10.1232190 secs]
> 2010-07-06T13:09:41.049-0700: 53661.279: [Full GC 7938M->5002M(8000M),
> 10.4997760 secs]
> 2010-07-06T13:18:51.531-0700: 54211.761: [Full GC 7938M->4962M(8000M),
> 11.0497380 secs]
>
> These pauses are pretty unacceptable for soft real time operation.
>
> Am I missing some tuning that should be done for G1GC for applications
> like this? Is 20ms out of 80ms too aggressive a target for the garbage
> rates we're generating?
>
> My actual live heap usage should be very stable around 5GB - the
> application very carefully accounts its live object set at around 60% of
> the max heap (as you can see in the logs above).
>
> At this point we are considering doing crazy things like ripping out our
> main memory consumers into a custom slab allocator, and manually
> reference count the byte array slices. But, if we can get G1GC to work
> for us, it will save a lot of engineering on the application side!
>
> Thanks
> -Todd
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>
>
> ------------------------------------------------------------------------
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
I also work with Todd on this systems (I am one of the other people
with the alternate CMS config) and doubling the heap size from 8 GB to
16GB is a little insane... we'd like to have some amount of reasonable
memory efficiency here... The thing is the more we can get out of our
ram for this block cache, the better performing our systems are. Also
a lot of the settings are self tuning, so if we up the Xmx the size of
the block cache is scaled as well.
-ryan
On Tue, Jul 6, 2010 at 2:12 PM, Y. S. Ramakrishna
<> wrote:
> Did you try doubling the heap size? You might want to post a full
> log so we can see what's happening between those full collections.
> Also, If you have comparable CMS logs
> all the better, as a known starting point. The full gc's almost
> look like the heap got too full, so it must mean that incremental
> collection is not keeping up with the rate of garbage generation.
> Also, what's the JDK build you are running?
>
> -- ramki
>
> On 07/06/10 13:27, Todd Lipcon wrote:
>> Hi all,
>>
>> I work on HBase, a distributed database written in Java. We generally
>> run on large heaps (8GB+), and our object lifetime distribution has
>> proven pretty problematic for garbage collection (we manage a multi-GB
>> LRU cache inside the process, so in CMS we tenure a lot of byte arrays
>> which later get collected).
>>
>> In Java6, we generally run with the CMS collector, tuning down
>> CMSInitiatingOccupancyFraction and constraining MaxNewSize to achieve
>> fairly low pause GC, but after a week or two of uptime we often run into
>> full heap compaction which takes several minutes and wreaks havoc on the
>> system.
>>
>> Needless to say, we've been watching the development of the G1 GC with
>> anticipation for the last year or two. Finally it seems in the latest
>> build of JDK7 it's stable enough for actual use (previously it would
>> segfault within an hour or so). However, in my testing I'm seeing a fair
>> amount of 8-10 second Full GC pauses.
>>
>> The flags I'm passing are:
>>
>> -Xmx8000m -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:GCPauseIntervalMillis=80
>>
>> Most of the pauses I see in the GC log are around 10-20ms as expected:
>>
>> 2010-07-05T22:43:19.849-0700: 1680.079: [GC pause (partial), 0.01209000
>> secs]
>> [Parallel Time: 10.5 ms]
>> [Update RS (Start) (ms): 1680080.2 1680080.1 1680080.2
>> 1680079.9 1680080.0 1680080.2 1680080.1 1680080.1 1680080.0
>> 1680080.1 1680080.0 1680079.9 1680081.5]
>> [Update RS (ms): 1.4 2.0 2.2 1.8 1.7 1.4 2.5 2.2 1.9 2.5
>> 1.7 1.7 0.1
>> Avg: 1.8, Min: 0.1, Max: 2.5]
>> [Processed Buffers : 8 1 3 1 1 7 3 2 6 2 7 8 3
>> Sum: 52, Avg: 4, Min: 1, Max: 8]
>> [Ext Root Scanning (ms): 0.7 0.5 0.5 0.3 0.5 0.7 0.4 0.5
>> 0.4 0.5 0.3 0.3 0.0
>> Avg: 0.4, Min: 0.0, Max: 0.7]
>> [Mark Stack Scanning (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
>> 0.0 0.0 0.0 0.0 0.0
>> Avg: 0.0, Min: 0.0, Max: 0.0]
>> [Scan RS (ms): 0.9 0.4 0.1 0.7 0.7 0.9 0.0 0.1 0.6 0.0
>> 0.8 0.8 0.9
>> Avg: 0.5, Min: 0.0, Max: 0.9]
>> [Object Copy (ms): 7.2 7.2 7.3 7.3 7.1 7.1 7.0 7.2 7.1
>> 6.9 7.1 7.1 7.0
>> Avg: 7.1, Min: 6.9, Max: 7.3]
>> [Termination (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
>> 0.0 0.0 0.0 0.0
>> Avg: 0.0, Min: 0.0, Max: 0.0]
>> [Other: 0.6 ms]
>> [Clear CT: 0.5 ms]
>> [Other: 1.1 ms]
>> [Choose CSet: 0.0 ms]
>> [ 7677M->7636M(8000M)]
>> [Times: user=0.12 sys=0.00, real=0.01 secs]
>>
>> But every 5-10 minutes I see a GC pause that lasts 10-15 seconds:
>> [todd@monster01 logs]$ grep 'Full GC' gc-hbase.log | tail
>> 2010-07-06T12:50:41.216-0700: 52521.446: [Full GC 7934M->4865M(8000M),
>> 9.8907800 secs]
>> 2010-07-06T12:55:39.802-0700: 52820.032: [Full GC 7930M->4964M(8000M),
>> 9.9025520 secs]
>> 2010-07-06T13:02:26.872-0700: 53227.102: [Full GC 7934M->4882M(8000M),
>> 10.1232190 secs]
>> 2010-07-06T13:09:41.049-0700: 53661.279: [Full GC 7938M->5002M(8000M),
>> 10.4997760 secs]
>> 2010-07-06T13:18:51.531-0700: 54211.761: [Full GC 7938M->4962M(8000M),
>> 11.0497380 secs]
>>
>> These pauses are pretty unacceptable for soft real time operation.
>>
>> Am I missing some tuning that should be done for G1GC for applications
>> like this? Is 20ms out of 80ms too aggressive a target for the garbage
>> rates we're generating?
>>
>> My actual live heap usage should be very stable around 5GB - the
>> application very carefully accounts its live object set at around 60% of
>> the max heap (as you can see in the logs above).
>>
>> At this point we are considering doing crazy things like ripping out our
>> main memory consumers into a custom slab allocator, and manually
>> reference count the byte array slices. But, if we can get G1GC to work
>> for us, it will save a lot of engineering on the application side!
>>
>> Thanks
>> -Todd
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
>>
>> ------------------------------------------------------------------------
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
Two questions:
(1) do you enable class unloading with CMS? (I do not see that
below in yr option list, but wondered.)
(2) does your application load classes, or intern a lot of strings?
If i am reading the logs right, G1 appears to reclaim less
and less of the heap in each cycle until a full collection
intervenes, and I have no real explanation for this behaviour
except that perhaps there's something in the perm gen that
keeps stuff in the remainder of the heap artificially live.
G1 does not incrementally collect the young gen, so this is
plausible. But CMS does not either by default and I do not see
that option in the CMS options list you gave below. It would
be instructive to see what the comparable CMS logs look like.
May be then you could start with the same heap shapes for the
two and see if you can get to the bottom of the full gc (which
as i understand you get to more quickly w/G1 than you did
w/CMS).
-- ramki
On 07/06/10 14:24, Todd Lipcon wrote:
> On Tue, Jul 6, 2010 at 2:09 PM, Jon Masamitsu <
> > wrote:
>
> Todd,
>
> Could you send a segment of the GC logs from the beginning
> through the first dozen or so full GC's?
>
>
> Sure, I just put it online at:
>
> http://cloudera-todd.s3.amazonaws.com/gc-log-g1gc.txt
>
>
>
> Exactly which version of the JVM are you using?
>
> java -version
>
> will tell us.
>
>
> Latest as of last night:
>
> [todd@monster01 ~]$ ./jdk1.7.0/jre/bin/java -version
> java version "1.7.0-ea"
> Java(TM) SE Runtime Environment (build 1.7.0-ea-b99)
> Java HotSpot(TM) 64-Bit Server VM (build 19.0-b03, mixed mode)
>
>
> Do you have a test setup where you could do some experiments?
>
>
> Sure, I have a five node cluster here where I do lots of testing, happy
> to try different builds/options/etc (though I probably don't have time
> to apply patches and rebuild the JDK myself)
>
>
> Can you send the set of CMS flags you use? It might tell
> us something about the GC behavior of you application.
> Might not tell us anything but it's worth a look.
>
>
> Different customers have found different flags to work well for them.
> One user uses the following:
>
>
> -Xmx12000m -XX:+HeapDumpOnOutOfMemoryError -XX:+UseConcMarkSweepGC \
> -XX:NewSize=64m -XX:MaxNewSize=64m -XX:CMSInitiatingOccupancyFraction=40 \
>
>
> Another uses:
>
>
> -XX:+DoEscapeAnalysis -XX:+AggressiveOpts -XX:+UseConcMarkSweepGC
> -XX:NewSize=64m -XX:MaxNewSize=64m
> -XX:CMSInitiatingOccupancyFraction=88 -verbose:gc -XX:+PrintGCDetails
>
>
>
>
>
>
>
> The particular tuning options probably depend on the actual cache
> workload of the user. I tend to recommend CMSInitiatingOccupancyFraction
> around 75 or so, since the software maintains about 60% heap usage. I
> also think a NewSize slightly larger would improve things a bit, but if
> it gets more than 256m or so, the ParNew pauses start to be too long for
> a lot of use cases.
>
> Regarding CMS logs, I can probably restart this test later this
> afternoon on CMS and run it for a couple hours, but it isn't likely to
> hit the multi-minute compaction that quickly. It happens more in the wild.
>
> -Todd
>
>
>
> On 07/06/10 13:27, Todd Lipcon wrote:
>> Hi all,
>>
>> I work on HBase, a distributed database written in Java. We
>> generally run on large heaps (8GB+), and our object lifetime
>> distribution has proven pretty problematic for garbage collection
>> (we manage a multi-GB LRU cache inside the process, so in CMS we
>> tenure a lot of byte arrays which later get collected).
>>
>> In Java6, we generally run with the CMS collector, tuning down
>> CMSInitiatingOccupancyFraction and constraining MaxNewSize to
>> achieve fairly low pause GC, but after a week or two of uptime we
>> often run into full heap compaction which takes several minutes
>> and wreaks havoc on the system.
>>
>> Needless to say, we've been watching the development of the G1 GC
>> with anticipation for the last year or two. Finally it seems in
>> the latest build of JDK7 it's stable enough for actual use
>> (previously it would segfault within an hour or so). However, in
>> my testing I'm seeing a fair amount of 8-10 second Full GC pauses.
>>
>> The flags I'm passing are:
>>
>> -Xmx8000m -XX:+UseG1GC -XX:MaxGCPauseMillis=20
>> -XX:GCPauseIntervalMillis=80
>>
>> Most of the pauses I see in the GC log are around 10-20ms as expected:
>>
>> 2010-07-05T22:43:19.849-0700: 1680.079: [GC pause (partial),
>> 0.01209000 secs]
>> [Parallel Time: 10.5 ms]
>> [Update RS (Start) (ms): 1680080.2 1680080.1 1680080.2
>> 1680079.9 1680080.0 1680080.2 1680080.1 1680080.1 1680080.0
>> 1680080.1 1680080.0 1680079.9 1680081.5]
>> [Update RS (ms): 1.4 2.0 2.2 1.8 1.7 1.4 2.5 2.2
>> 1.9 2.5 1.7 1.7 0.1
>> Avg: 1.8, Min: 0.1, Max: 2.5]
>> [Processed Buffers : 8 1 3 1 1 7 3 2 6 2 7 8 3
>> Sum: 52, Avg: 4, Min: 1, Max: 8]
>> [Ext Root Scanning (ms): 0.7 0.5 0.5 0.3 0.5 0.7 0.4
>> 0.5 0.4 0.5 0.3 0.3 0.0
>> Avg: 0.4, Min: 0.0, Max: 0.7]
>> [Mark Stack Scanning (ms): 0.0 0.0 0.0 0.0 0.0 0.0
>> 0.0 0.0 0.0 0.0 0.0 0.0 0.0
>> Avg: 0.0, Min: 0.0, Max: 0.0]
>> [Scan RS (ms): 0.9 0.4 0.1 0.7 0.7 0.9 0.0 0.1 0.6
>> 0.0 0.8 0.8 0.9
>> Avg: 0.5, Min: 0.0, Max: 0.9]
>> [Object Copy (ms): 7.2 7.2 7.3 7.3 7.1 7.1 7.0 7.2
>> 7.1 6.9 7.1 7.1 7.0
>> Avg: 7.1, Min: 6.9, Max: 7.3]
>> [Termination (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
>> 0.0 0.0 0.0 0.0 0.0
>> Avg: 0.0, Min: 0.0, Max: 0.0]
>> [Other: 0.6 ms]
>> [Clear CT: 0.5 ms]
>> [Other: 1.1 ms]
>> [Choose CSet: 0.0 ms]
>> [ 7677M->7636M(8000M)]
>> [Times: user=0.12 sys=0.00, real=0.01 secs]
>>
>> But every 5-10 minutes I see a GC pause that lasts 10-15 seconds:
>> [todd@monster01 logs]$ grep 'Full GC' gc-hbase.log | tail
>> 2010-07-06T12:50:41.216-0700: 52521.446: [Full GC
>> 7934M->4865M(8000M), 9.8907800 secs]
>> 2010-07-06T12:55:39.802-0700: 52820.032: [Full GC
>> 7930M->4964M(8000M), 9.9025520 secs]
>> 2010-07-06T13:02:26.872-0700: 53227.102: [Full GC
>> 7934M->4882M(8000M), 10.1232190 secs]
>> 2010-07-06T13:09:41.049-0700: 53661.279: [Full GC
>> 7938M->5002M(8000M), 10.4997760 secs]
>> 2010-07-06T13:18:51.531-0700: 54211.761: [Full GC
>> 7938M->4962M(8000M), 11.0497380 secs]
>>
>> These pauses are pretty unacceptable for soft real time operation.
>>
>> Am I missing some tuning that should be done for G1GC for
>> applications like this? Is 20ms out of 80ms too aggressive a
>> target for the garbage rates we're generating?
>>
>> My actual live heap usage should be very stable around 5GB - the
>> application very carefully accounts its live object set at around
>> 60% of the max heap (as you can see in the logs above).
>>
>> At this point we are considering doing crazy things like ripping
>> out our main memory consumers into a custom slab allocator, and
>> manually reference count the byte array slices. But, if we can get
>> G1GC to work for us, it will save a lot of engineering on the
>> application side!
>>
>> Thanks
>> -Todd
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>> ------------------------------------------------------------------------
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
On 07/07/10 08:45, Todd Lipcon wrote:
...
>
> Overnight I saw one "concurrent mode failure".
...
> 2010-07-07T07:56:27.786-0700: 28490.203: [GC 28490.203: [ParNew
> (promotion failed): 59008K->59008K(59008K), 0.0179250 secs]28490.221:
> [CMS2010-07-07T07:56:27.901-0700: 28490.317: [CMS-concurrent-preclean:
> 0.556/0.947 secs] [Times:
> user=5.76 sys=0.26, real=0.95 secs]
> (concurrent mode failure): 6359176K->4206871K(8323072K), 17.4366220
> secs] 6417373K->4206871K(8382080K), [CMS Perm : 18609K->18565K(31048K)],
> 17.4546890 secs] [Times: user=11.17 sys=0.09, real=17.45 secs]
>
> I've interpreted pauses like this as being caused by fragmentation,
> since the young gen is 64M, and the old gen here has about 2G free. If
> there's something I'm not understanding about CMS, and I can tune it
> more smartly to avoid these longer pauses, I'm happy to try.
Yes the old gen must be fragmented. I'll look at the data you have
made available (for CMS). The CMS log you uploaded does not have the
suffix leading into the concurrent mode failure ypu display above
(it stops less than 2500 s into the run). If you could include
the entire log leading into the concurrent mode failures, it would
be a great help. Do you have large arrays in your
application? The shape of the promotion graph for CMS is somewhat
jagged, indicating _perhaps_ that. Yes, +PrintTenuringDistribution
would shed a bit more light. As regards fragmentation, it can be
tricky to tune against, but we can try once we understand a bit
more about the object sizes and demographics.
I am sure you don't have an easily shared test case, so we
can reproduce both the CMS fragmentation and the G1 full gc
issues locally for quickest progress on this?
-- ramki
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
On 07/07/10 17:18, Todd Lipcon wrote:
...
> Looking at the graph you attached, it appears that the low-water mark
> stabilizes at somewhere between 4.5G and 5G. The configuration I'm
> running is to allocate 40% of the heap to Memstore and 20% of the heap
> to the LRU cache. For an 8G heap, this is 4.8GB. So, for this
> application it's somewhat expected that, as it runs, it will accumulate
> more and more data until it reaches this threshold. The data is, of
> course, not *permanent*, but it's reasonably long-lived, so it makes
> sense to me that it should go into the old generation.
Ah, i see. In that case, i think you could try using a slightly larger old
gen. If the old gen stabilizes at 4.2 GB, we should allow as much for slop.
i.e. make the old gen 8.4 GB (or whatever is the measured stable
old gen occupancy), then add to that the young gen size, and use
that for the whole heap. I would be even more aggressive
and grant more to the old gen -- as i said earlier perhaps
double the old gen from its present size. If that doesn;t work
we know that something is amiss in the way we are going at this.
If it works, we can iterate downwards from a config that we know
works, down to what may be considered an acceptable space overhead
for GC.
>
> If you like, I can tune down those percentages to 20/20 instead of
> 20/40, and I think we'll see the same pattern, just stabilized around
> 3.2GB. This will probably delay the full GCs, but still eventually hit
> them. It's also way lower than we can really go - customers won't like
> "throwing away" 60% of the allocated heap to GC!
I understand that sentiment. I want us to get to a state where we are able
to completely avoid the creeping fragmentation, if possible. There are
other ways to tune for this, but they are more labour-intensive and tricky,
and I would not want to go into that lightly. You might want to contact
your Java support for help with that.
>
>
> Perhaps jmap -histo:live or +PrintClassHistogram[AfterFullGC]
> will help you get to the bottom of yr leak. Once the leak is plugged
> perhaps we could come back to the G1 tuning effort? (We have some
> guesses as to what might be happening and the best G1 minds are
> chewing on the info you provided so far, for which thanks!)
>
>
> I can try running with those options and see what I see, but I've
> already spent some time looking at heap dumps, and not found any leaks,
> so I'm pretty sure it's not the issue.
OK, in that case it's not worth doing, since you've already ruled
out leaks.
I'll think some more about this meanwhile.
thanks.
-- ramki
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
On 07/07/10 17:32, Todd Lipcon wrote:
...
> OK, I can try some tests with cache configured for only 40% heap usage.
> Should I run these tests with CMS or G1?
I'd first try CMS, and if that works, try G1.
>
>
>
>
> If you like, I can tune down those percentages to 20/20 instead
> of 20/40, and I think we'll see the same pattern, just
> stabilized around 3.2GB. This will probably delay the full GCs,
> but still eventually hit them. It's also way lower than we can
> really go - customers won't like "throwing away" 60% of the
> allocated heap to GC!
>
>
> I understand that sentiment. I want us to get to a state where we
> are able
> to completely avoid the creeping fragmentation, if possible. There are
> other ways to tune for this, but they are more labour-intensive and
> tricky,
> and I would not want to go into that lightly. You might want to contact
> your Java support for help with that.
>
>
> Yep, we've considered various solutions involving managing our own
> ref-counted slices of a single pre-allocated byte array - essentially
> writing our own slab allocator. In theory this should make all of the
> GCable objects constrained to a small number of sizes, and thus prevent
> fragmentation, but it's quite a project to undertake :)
That would be overdoing it. I didn't mean anything so drastic and certainly
nothing so drastic at the application level. When I said "labour intensive"
I meant tuning GC to avoid that kind of fragmentation would be more work.
>
> Regarding Java support, as an open source project we have no such
> luxury. Projects like HBase and Hadoop, though, are pretty visible to
> users as "big Java apps", so getting them working well on the GC front
> does good things for Java adoption in the database/distributed systems
> community, I think.
I agree, and we certainly should.
-- ramki
>
> -Todd
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>
>
> ------------------------------------------------------------------------
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
> Am I missing some tuning that should be done for G1GC for applications like
> this? Is 20ms out of 80ms too aggressive a target for the garbage rates
> we're generating?
I have never run HBase, but in an LRU stress test (I posted about it a
few months ago) I specifically observed remembered set scanning costs
go way up. In addition I was seeing fallbacks to full GC:s recently in
a slightly different test that I also posed about to -use, and that
turned out to be a result of the estimated rset scanning costs being
so high that regions were never selected for eviction even though they
had very little live data. I would be very interested to hear if
you're having the same problem. My last post on the topic is here:
http://mail.openjdk.java.net/pipermail/hotspot-gc-use/2010-June/000652.html
Including the link to the (throw-away) patch that should tell you
whether this is what's happening:
http://distfiles.scode.org/mlref/g1/g1_region_live_stats_hack.patch
Out of personal curiosity I'd be very interested to hear whether this
is what's happening to you (in a real reasonable use-case rather than
a synthetic benchmark).
My sense (and hotspot/g1 developers please smack me around if I am
misrepresenting anything here) is that the effect I saw (with rset
scanning costs) could cause perpetual memory grow (until fallback to
full GC) in two ways:
(1) The estimated (and possibly real) cost of rset scanning for a
single region could be so high that it is never possible to select it
for eviction given the asked for pause time goals. Hence, such a
region effectively "leaks" until full GC.
(2) The estimated (and possibly real) cost of rset scanning for
regions may be so high that there are, in practice, always other
regions selected for high pay-off/cost ratios, such that they end up
never being collected even if theoretically a single region could be
evicted within the pause time goal.
These are effectively the same thing, with (1) being an extreme case of (2).
In both cases, the effect should be mitigated (and have been in the
case where I did my testing), but as far as I can tell not generally
"fixed", by increasing the pause time goals.
It is unclear to me how this is intended to be handled. The original
g1 paper mentions an rset scanning thread that I may suspect would be
intended to help do rset scanning in the background such that regions
like these could be evicted more cheaply during the STW eviction
pause; but I didn't find such a thread anywhere in the source code -
but I may very well just be missing it.
--
/ Peter Schuller
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
Hi Peter --
Yes, my guess was also that something (possibly along the lines
you stated below) was preventing the selection of certain (sets
of) regions for evacuation on a regular basis ... I am told there
are flags that will allow you to get verbose details on what is
or is not selected for inclusion in the collection set; perhaps
that will help you get down to the bottom of this. Did you say
you had a test case that showed this behaviour? Filing a bug
with that test case may be the quickest way to get this before
the right set of eyes. Over to the G1 cognoscenti.
-- ramki
On 07/12/10 09:02, Peter Schuller wrote:
>> Am I missing some tuning that should be done for G1GC for applications like
>> this? Is 20ms out of 80ms too aggressive a target for the garbage rates
>> we're generating?
>
> I have never run HBase, but in an LRU stress test (I posted about it a
> few months ago) I specifically observed remembered set scanning costs
> go way up. In addition I was seeing fallbacks to full GC:s recently in
> a slightly different test that I also posed about to -use, and that
> turned out to be a result of the estimated rset scanning costs being
> so high that regions were never selected for eviction even though they
> had very little live data. I would be very interested to hear if
> you're having the same problem. My last post on the topic is here:
>
> http://mail.openjdk.java.net/pipermail/hotspot-gc-use/2010-June/000652.html
>
> Including the link to the (throw-away) patch that should tell you
> whether this is what's happening:
>
> http://distfiles.scode.org/mlref/g1/g1_region_live_stats_hack.patch
>
> Out of personal curiosity I'd be very interested to hear whether this
> is what's happening to you (in a real reasonable use-case rather than
> a synthetic benchmark).
>
> My sense (and hotspot/g1 developers please smack me around if I am
> misrepresenting anything here) is that the effect I saw (with rset
> scanning costs) could cause perpetual memory grow (until fallback to
> full GC) in two ways:
>
> (1) The estimated (and possibly real) cost of rset scanning for a
> single region could be so high that it is never possible to select it
> for eviction given the asked for pause time goals. Hence, such a
> region effectively "leaks" until full GC.
>
> (2) The estimated (and possibly real) cost of rset scanning for
> regions may be so high that there are, in practice, always other
> regions selected for high pay-off/cost ratios, such that they end up
> never being collected even if theoretically a single region could be
> evicted within the pause time goal.
>
> These are effectively the same thing, with (1) being an extreme case of (2).
>
> In both cases, the effect should be mitigated (and have been in the
> case where I did my testing), but as far as I can tell not generally
> "fixed", by increasing the pause time goals.
>
> It is unclear to me how this is intended to be handled. The original
> g1 paper mentions an rset scanning thread that I may suspect would be
> intended to help do rset scanning in the background such that regions
> like these could be evicted more cheaply during the STW eviction
> pause; but I didn't find such a thread anywhere in the source code -
> but I may very well just be missing it.
>
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
Peter and Todd,
Any chance of setting +PrintHeapAtGC and -XX:+PrintHeapAtGCExtended and
sending us the log, or part of it (say between two Full GCs)? Be
prepared: this will generate piles of output. But it will give us
per-region information that might shed more light on the cause of the
issue.... thanks,
Tony, HS GC Group
Peter Schuller wrote:
>> Am I missing some tuning that should be done for G1GC for applications like
>> this? Is 20ms out of 80ms too aggressive a target for the garbage rates
>> we're generating?
>>
>
> I have never run HBase, but in an LRU stress test (I posted about it a
> few months ago) I specifically observed remembered set scanning costs
> go way up. In addition I was seeing fallbacks to full GC:s recently in
> a slightly different test that I also posed about to -use, and that
> turned out to be a result of the estimated rset scanning costs being
> so high that regions were never selected for eviction even though they
> had very little live data. I would be very interested to hear if
> you're having the same problem. My last post on the topic is here:
>
> http://mail.openjdk.java.net/pipermail/hotspot-gc-use/2010-June/000652.html
>
> Including the link to the (throw-away) patch that should tell you
> whether this is what's happening:
>
> http://distfiles.scode.org/mlref/g1/g1_region_live_stats_hack.patch
>
> Out of personal curiosity I'd be very interested to hear whether this
> is what's happening to you (in a real reasonable use-case rather than
> a synthetic benchmark).
>
> My sense (and hotspot/g1 developers please smack me around if I am
> misrepresenting anything here) is that the effect I saw (with rset
> scanning costs) could cause perpetual memory grow (until fallback to
> full GC) in two ways:
>
> (1) The estimated (and possibly real) cost of rset scanning for a
> single region could be so high that it is never possible to select it
> for eviction given the asked for pause time goals. Hence, such a
> region effectively "leaks" until full GC.
>
> (2) The estimated (and possibly real) cost of rset scanning for
> regions may be so high that there are, in practice, always other
> regions selected for high pay-off/cost ratios, such that they end up
> never being collected even if theoretically a single region could be
> evicted within the pause time goal.
>
> These are effectively the same thing, with (1) being an extreme case of (2).
>
> In both cases, the effect should be mitigated (and have been in the
> case where I did my testing), but as far as I can tell not generally
> "fixed", by increasing the pause time goals.
>
> It is unclear to me how this is intended to be handled. The original
> g1 paper mentions an rset scanning thread that I may suspect would be
> intended to help do rset scanning in the background such that regions
> like these could be evicted more cheaply during the STW eviction
> pause; but I didn't find such a thread anywhere in the source code -
> but I may very well just be missing it.
>
>
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
|
# 11

14-07-2010 01:15 AM
|
|
|
Todd,
Could you send a segment of the GC logs from the beginning
through the first dozen or so full GC's?
Exactly which version of the JVM are you using?
java -version
will tell us.
Do you have a test setup where you could do some experiments?
Can you send the set of CMS flags you use? It might tell
us something about the GC behavior of you application.
Might not tell us anything but it's worth a look.
Jon
On 07/06/10 13:27, Todd Lipcon wrote:
> Hi all,
>
> I work on HBase, a distributed database written in Java. We generally
> run on large heaps (8GB+), and our object lifetime distribution has
> proven pretty problematic for garbage collection (we manage a multi-GB
> LRU cache inside the process, so in CMS we tenure a lot of byte arrays
> which later get collected).
>
> In Java6, we generally run with the CMS collector, tuning down
> CMSInitiatingOccupancyFraction and constraining MaxNewSize to achieve
> fairly low pause GC, but after a week or two of uptime we often run
> into full heap compaction which takes several minutes and wreaks havoc
> on the system.
>
> Needless to say, we've been watching the development of the G1 GC with
> anticipation for the last year or two. Finally it seems in the latest
> build of JDK7 it's stable enough for actual use (previously it would
> segfault within an hour or so). However, in my testing I'm seeing a
> fair amount of 8-10 second Full GC pauses.
>
> The flags I'm passing are:
>
> -Xmx8000m -XX:+UseG1GC -XX:MaxGCPauseMillis=20
> -XX:GCPauseIntervalMillis=80
>
> Most of the pauses I see in the GC log are around 10-20ms as expected:
>
> 2010-07-05T22:43:19.849-0700: 1680.079: [GC pause (partial),
> 0.01209000 secs]
> [Parallel Time: 10.5 ms]
> [Update RS (Start) (ms): 1680080.2 1680080.1 1680080.2
> 1680079.9 1680080.0 1680080.2 1680080.1 1680080.1 1680080.0
> 1680080.1 1680080.0 1680079.9 1680081.5]
> [Update RS (ms): 1.4 2.0 2.2 1.8 1.7 1.4 2.5 2.2 1.9
> 2.5 1.7 1.7 0.1
> Avg: 1.8, Min: 0.1, Max: 2.5]
> [Processed Buffers : 8 1 3 1 1 7 3 2 6 2 7 8 3
> Sum: 52, Avg: 4, Min: 1, Max: 8]
> [Ext Root Scanning (ms): 0.7 0.5 0.5 0.3 0.5 0.7 0.4 0.5
> 0.4 0.5 0.3 0.3 0.0
> Avg: 0.4, Min: 0.0, Max: 0.7]
> [Mark Stack Scanning (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0
> 0.0 0.0 0.0 0.0 0.0 0.0
> Avg: 0.0, Min: 0.0, Max: 0.0]
> [Scan RS (ms): 0.9 0.4 0.1 0.7 0.7 0.9 0.0 0.1 0.6 0.0
> 0.8 0.8 0.9
> Avg: 0.5, Min: 0.0, Max: 0.9]
> [Object Copy (ms): 7.2 7.2 7.3 7.3 7.1 7.1 7.0 7.2 7.1
> 6.9 7.1 7.1 7.0
> Avg: 7.1, Min: 6.9, Max: 7.3]
> [Termination (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
> 0.0 0.0 0.0 0.0
> Avg: 0.0, Min: 0.0, Max: 0.0]
> [Other: 0.6 ms]
> [Clear CT: 0.5 ms]
> [Other: 1.1 ms]
> [Choose CSet: 0.0 ms]
> [ 7677M->7636M(8000M)]
> [Times: user=0.12 sys=0.00, real=0.01 secs]
>
> But every 5-10 minutes I see a GC pause that lasts 10-15 seconds:
> [todd@monster01 logs]$ grep 'Full GC' gc-hbase.log | tail
> 2010-07-06T12:50:41.216-0700: 52521.446: [Full GC 7934M->4865M(8000M),
> 9.8907800 secs]
> 2010-07-06T12:55:39.802-0700: 52820.032: [Full GC 7930M->4964M(8000M),
> 9.9025520 secs]
> 2010-07-06T13:02:26.872-0700: 53227.102: [Full GC 7934M->4882M(8000M),
> 10.1232190 secs]
> 2010-07-06T13:09:41.049-0700: 53661.279: [Full GC 7938M->5002M(8000M),
> 10.4997760 secs]
> 2010-07-06T13:18:51.531-0700: 54211.761: [Full GC 7938M->4962M(8000M),
> 11.0497380 secs]
>
> These pauses are pretty unacceptable for soft real time operation.
>
> Am I missing some tuning that should be done for G1GC for applications
> like this? Is 20ms out of 80ms too aggressive a target for the garbage
> rates we're generating?
>
> My actual live heap usage should be very stable around 5GB - the
> application very carefully accounts its live object set at around 60%
> of the max heap (as you can see in the logs above).
>
> At this point we are considering doing crazy things like ripping out
> our main memory consumers into a custom slab allocator, and manually
> reference count the byte array slices. But, if we can get G1GC to work
> for us, it will save a lot of engineering on the application side!
>
> Thanks
> -Todd
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
> ------------------------------------------------------------------------
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
Did you try doubling the heap size? You might want to post a full
log so we can see what's happening between those full collections.
Also, If you have comparable CMS logs
all the better, as a known starting point. The full gc's almost
look like the heap got too full, so it must mean that incremental
collection is not keeping up with the rate of garbage generation.
Also, what's the JDK build you are running?
-- ramki
On 07/06/10 13:27, Todd Lipcon wrote:
> Hi all,
>
> I work on HBase, a distributed database written in Java. We generally
> run on large heaps (8GB+), and our object lifetime distribution has
> proven pretty problematic for garbage collection (we manage a multi-GB
> LRU cache inside the process, so in CMS we tenure a lot of byte arrays
> which later get collected).
>
> In Java6, we generally run with the CMS collector, tuning down
> CMSInitiatingOccupancyFraction and constraining MaxNewSize to achieve
> fairly low pause GC, but after a week or two of uptime we often run into
> full heap compaction which takes several minutes and wreaks havoc on the
> system.
>
> Needless to say, we've been watching the development of the G1 GC with
> anticipation for the last year or two. Finally it seems in the latest
> build of JDK7 it's stable enough for actual use (previously it would
> segfault within an hour or so). However, in my testing I'm seeing a fair
> amount of 8-10 second Full GC pauses.
>
> The flags I'm passing are:
>
> -Xmx8000m -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:GCPauseIntervalMillis=80
>
> Most of the pauses I see in the GC log are around 10-20ms as expected:
>
> 2010-07-05T22:43:19.849-0700: 1680.079: [GC pause (partial), 0.01209000
> secs]
> [Parallel Time: 10.5 ms]
> [Update RS (Start) (ms): 1680080.2 1680080.1 1680080.2
> 1680079.9 1680080.0 1680080.2 1680080.1 1680080.1 1680080.0
> 1680080.1 1680080.0 1680079.9 1680081.5]
> [Update RS (ms): 1.4 2.0 2.2 1.8 1.7 1.4 2.5 2.2 1.9 2.5
> 1.7 1.7 0.1
> Avg: 1.8, Min: 0.1, Max: 2.5]
> [Processed Buffers : 8 1 3 1 1 7 3 2 6 2 7 8 3
> Sum: 52, Avg: 4, Min: 1, Max: 8]
> [Ext Root Scanning (ms): 0.7 0.5 0.5 0.3 0.5 0.7 0.4 0.5
> 0.4 0.5 0.3 0.3 0.0
> Avg: 0.4, Min: 0.0, Max: 0.7]
> [Mark Stack Scanning (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
> 0.0 0.0 0.0 0.0 0.0
> Avg: 0.0, Min: 0.0, Max: 0.0]
> [Scan RS (ms): 0.9 0.4 0.1 0.7 0.7 0.9 0.0 0.1 0.6 0.0
> 0.8 0.8 0.9
> Avg: 0.5, Min: 0.0, Max: 0.9]
> [Object Copy (ms): 7.2 7.2 7.3 7.3 7.1 7.1 7.0 7.2 7.1
> 6.9 7.1 7.1 7.0
> Avg: 7.1, Min: 6.9, Max: 7.3]
> [Termination (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
> 0.0 0.0 0.0 0.0
> Avg: 0.0, Min: 0.0, Max: 0.0]
> [Other: 0.6 ms]
> [Clear CT: 0.5 ms]
> [Other: 1.1 ms]
> [Choose CSet: 0.0 ms]
> [ 7677M->7636M(8000M)]
> [Times: user=0.12 sys=0.00, real=0.01 secs]
>
> But every 5-10 minutes I see a GC pause that lasts 10-15 seconds:
> [todd@monster01 logs]$ grep 'Full GC' gc-hbase.log | tail
> 2010-07-06T12:50:41.216-0700: 52521.446: [Full GC 7934M->4865M(8000M),
> 9.8907800 secs]
> 2010-07-06T12:55:39.802-0700: 52820.032: [Full GC 7930M->4964M(8000M),
> 9.9025520 secs]
> 2010-07-06T13:02:26.872-0700: 53227.102: [Full GC 7934M->4882M(8000M),
> 10.1232190 secs]
> 2010-07-06T13:09:41.049-0700: 53661.279: [Full GC 7938M->5002M(8000M),
> 10.4997760 secs]
> 2010-07-06T13:18:51.531-0700: 54211.761: [Full GC 7938M->4962M(8000M),
> 11.0497380 secs]
>
> These pauses are pretty unacceptable for soft real time operation.
>
> Am I missing some tuning that should be done for G1GC for applications
> like this? Is 20ms out of 80ms too aggressive a target for the garbage
> rates we're generating?
>
> My actual live heap usage should be very stable around 5GB - the
> application very carefully accounts its live object set at around 60% of
> the max heap (as you can see in the logs above).
>
> At this point we are considering doing crazy things like ripping out our
> main memory consumers into a custom slab allocator, and manually
> reference count the byte array slices. But, if we can get G1GC to work
> for us, it will save a lot of engineering on the application side!
>
> Thanks
> -Todd
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>
>
> ------------------------------------------------------------------------
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
I also work with Todd on this systems (I am one of the other people
with the alternate CMS config) and doubling the heap size from 8 GB to
16GB is a little insane... we'd like to have some amount of reasonable
memory efficiency here... The thing is the more we can get out of our
ram for this block cache, the better performing our systems are. Also
a lot of the settings are self tuning, so if we up the Xmx the size of
the block cache is scaled as well.
-ryan
On Tue, Jul 6, 2010 at 2:12 PM, Y. S. Ramakrishna
<> wrote:
> Did you try doubling the heap size? You might want to post a full
> log so we can see what's happening between those full collections.
> Also, If you have comparable CMS logs
> all the better, as a known starting point. The full gc's almost
> look like the heap got too full, so it must mean that incremental
> collection is not keeping up with the rate of garbage generation.
> Also, what's the JDK build you are running?
>
> -- ramki
>
> On 07/06/10 13:27, Todd Lipcon wrote:
>> Hi all,
>>
>> I work on HBase, a distributed database written in Java. We generally
>> run on large heaps (8GB+), and our object lifetime distribution has
>> proven pretty problematic for garbage collection (we manage a multi-GB
>> LRU cache inside the process, so in CMS we tenure a lot of byte arrays
>> which later get collected).
>>
>> In Java6, we generally run with the CMS collector, tuning down
>> CMSInitiatingOccupancyFraction and constraining MaxNewSize to achieve
>> fairly low pause GC, but after a week or two of uptime we often run into
>> full heap compaction which takes several minutes and wreaks havoc on the
>> system.
>>
>> Needless to say, we've been watching the development of the G1 GC with
>> anticipation for the last year or two. Finally it seems in the latest
>> build of JDK7 it's stable enough for actual use (previously it would
>> segfault within an hour or so). However, in my testing I'm seeing a fair
>> amount of 8-10 second Full GC pauses.
>>
>> The flags I'm passing are:
>>
>> -Xmx8000m -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:GCPauseIntervalMillis=80
>>
>> Most of the pauses I see in the GC log are around 10-20ms as expected:
>>
>> 2010-07-05T22:43:19.849-0700: 1680.079: [GC pause (partial), 0.01209000
>> secs]
>> [Parallel Time: 10.5 ms]
>> [Update RS (Start) (ms): 1680080.2 1680080.1 1680080.2
>> 1680079.9 1680080.0 1680080.2 1680080.1 1680080.1 1680080.0
>> 1680080.1 1680080.0 1680079.9 1680081.5]
>> [Update RS (ms): 1.4 2.0 2.2 1.8 1.7 1.4 2.5 2.2 1.9 2.5
>> 1.7 1.7 0.1
>> Avg: 1.8, Min: 0.1, Max: 2.5]
>> [Processed Buffers : 8 1 3 1 1 7 3 2 6 2 7 8 3
>> Sum: 52, Avg: 4, Min: 1, Max: 8]
>> [Ext Root Scanning (ms): 0.7 0.5 0.5 0.3 0.5 0.7 0.4 0.5
>> 0.4 0.5 0.3 0.3 0.0
>> Avg: 0.4, Min: 0.0, Max: 0.7]
>> [Mark Stack Scanning (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
>> 0.0 0.0 0.0 0.0 0.0
>> Avg: 0.0, Min: 0.0, Max: 0.0]
>> [Scan RS (ms): 0.9 0.4 0.1 0.7 0.7 0.9 0.0 0.1 0.6 0.0
>> 0.8 0.8 0.9
>> Avg: 0.5, Min: 0.0, Max: 0.9]
>> [Object Copy (ms): 7.2 7.2 7.3 7.3 7.1 7.1 7.0 7.2 7.1
>> 6.9 7.1 7.1 7.0
>> Avg: 7.1, Min: 6.9, Max: 7.3]
>> [Termination (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
>> 0.0 0.0 0.0 0.0
>> Avg: 0.0, Min: 0.0, Max: 0.0]
>> [Other: 0.6 ms]
>> [Clear CT: 0.5 ms]
>> [Other: 1.1 ms]
>> [Choose CSet: 0.0 ms]
>> [ 7677M->7636M(8000M)]
>> [Times: user=0.12 sys=0.00, real=0.01 secs]
>>
>> But every 5-10 minutes I see a GC pause that lasts 10-15 seconds:
>> [todd@monster01 logs]$ grep 'Full GC' gc-hbase.log | tail
>> 2010-07-06T12:50:41.216-0700: 52521.446: [Full GC 7934M->4865M(8000M),
>> 9.8907800 secs]
>> 2010-07-06T12:55:39.802-0700: 52820.032: [Full GC 7930M->4964M(8000M),
>> 9.9025520 secs]
>> 2010-07-06T13:02:26.872-0700: 53227.102: [Full GC 7934M->4882M(8000M),
>> 10.1232190 secs]
>> 2010-07-06T13:09:41.049-0700: 53661.279: [Full GC 7938M->5002M(8000M),
>> 10.4997760 secs]
>> 2010-07-06T13:18:51.531-0700: 54211.761: [Full GC 7938M->4962M(8000M),
>> 11.0497380 secs]
>>
>> These pauses are pretty unacceptable for soft real time operation.
>>
>> Am I missing some tuning that should be done for G1GC for applications
>> like this? Is 20ms out of 80ms too aggressive a target for the garbage
>> rates we're generating?
>>
>> My actual live heap usage should be very stable around 5GB - the
>> application very carefully accounts its live object set at around 60% of
>> the max heap (as you can see in the logs above).
>>
>> At this point we are considering doing crazy things like ripping out our
>> main memory consumers into a custom slab allocator, and manually
>> reference count the byte array slices. But, if we can get G1GC to work
>> for us, it will save a lot of engineering on the application side!
>>
>> Thanks
>> -Todd
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
>>
>> ------------------------------------------------------------------------
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
Two questions:
(1) do you enable class unloading with CMS? (I do not see that
below in yr option list, but wondered.)
(2) does your application load classes, or intern a lot of strings?
If i am reading the logs right, G1 appears to reclaim less
and less of the heap in each cycle until a full collection
intervenes, and I have no real explanation for this behaviour
except that perhaps there's something in the perm gen that
keeps stuff in the remainder of the heap artificially live.
G1 does not incrementally collect the young gen, so this is
plausible. But CMS does not either by default and I do not see
that option in the CMS options list you gave below. It would
be instructive to see what the comparable CMS logs look like.
May be then you could start with the same heap shapes for the
two and see if you can get to the bottom of the full gc (which
as i understand you get to more quickly w/G1 than you did
w/CMS).
-- ramki
On 07/06/10 14:24, Todd Lipcon wrote:
> On Tue, Jul 6, 2010 at 2:09 PM, Jon Masamitsu <
> > wrote:
>
> Todd,
>
> Could you send a segment of the GC logs from the beginning
> through the first dozen or so full GC's?
>
>
> Sure, I just put it online at:
>
> http://cloudera-todd.s3.amazonaws.com/gc-log-g1gc.txt
>
>
>
> Exactly which version of the JVM are you using?
>
> java -version
>
> will tell us.
>
>
> Latest as of last night:
>
> [todd@monster01 ~]$ ./jdk1.7.0/jre/bin/java -version
> java version "1.7.0-ea"
> Java(TM) SE Runtime Environment (build 1.7.0-ea-b99)
> Java HotSpot(TM) 64-Bit Server VM (build 19.0-b03, mixed mode)
>
>
> Do you have a test setup where you could do some experiments?
>
>
> Sure, I have a five node cluster here where I do lots of testing, happy
> to try different builds/options/etc (though I probably don't have time
> to apply patches and rebuild the JDK myself)
>
>
> Can you send the set of CMS flags you use? It might tell
> us something about the GC behavior of you application.
> Might not tell us anything but it's worth a look.
>
>
> Different customers have found different flags to work well for them.
> One user uses the following:
>
>
> -Xmx12000m -XX:+HeapDumpOnOutOfMemoryError -XX:+UseConcMarkSweepGC \
> -XX:NewSize=64m -XX:MaxNewSize=64m -XX:CMSInitiatingOccupancyFraction=40 \
>
>
> Another uses:
>
>
> -XX:+DoEscapeAnalysis -XX:+AggressiveOpts -XX:+UseConcMarkSweepGC
> -XX:NewSize=64m -XX:MaxNewSize=64m
> -XX:CMSInitiatingOccupancyFraction=88 -verbose:gc -XX:+PrintGCDetails
>
>
>
>
>
>
>
> The particular tuning options probably depend on the actual cache
> workload of the user. I tend to recommend CMSInitiatingOccupancyFraction
> around 75 or so, since the software maintains about 60% heap usage. I
> also think a NewSize slightly larger would improve things a bit, but if
> it gets more than 256m or so, the ParNew pauses start to be too long for
> a lot of use cases.
>
> Regarding CMS logs, I can probably restart this test later this
> afternoon on CMS and run it for a couple hours, but it isn't likely to
> hit the multi-minute compaction that quickly. It happens more in the wild.
>
> -Todd
>
>
>
> On 07/06/10 13:27, Todd Lipcon wrote:
>> Hi all,
>>
>> I work on HBase, a distributed database written in Java. We
>> generally run on large heaps (8GB+), and our object lifetime
>> distribution has proven pretty problematic for garbage collection
>> (we manage a multi-GB LRU cache inside the process, so in CMS we
>> tenure a lot of byte arrays which later get collected).
>>
>> In Java6, we generally run with the CMS collector, tuning down
>> CMSInitiatingOccupancyFraction and constraining MaxNewSize to
>> achieve fairly low pause GC, but after a week or two of uptime we
>> often run into full heap compaction which takes several minutes
>> and wreaks havoc on the system.
>>
>> Needless to say, we've been watching the development of the G1 GC
>> with anticipation for the last year or two. Finally it seems in
>> the latest build of JDK7 it's stable enough for actual use
>> (previously it would segfault within an hour or so). However, in
>> my testing I'm seeing a fair amount of 8-10 second Full GC pauses.
>>
>> The flags I'm passing are:
>>
>> -Xmx8000m -XX:+UseG1GC -XX:MaxGCPauseMillis=20
>> -XX:GCPauseIntervalMillis=80
>>
>> Most of the pauses I see in the GC log are around 10-20ms as expected:
>>
>> 2010-07-05T22:43:19.849-0700: 1680.079: [GC pause (partial),
>> 0.01209000 secs]
>> [Parallel Time: 10.5 ms]
>> [Update RS (Start) (ms): 1680080.2 1680080.1 1680080.2
>> 1680079.9 1680080.0 1680080.2 1680080.1 1680080.1 1680080.0
>> 1680080.1 1680080.0 1680079.9 1680081.5]
>> [Update RS (ms): 1.4 2.0 2.2 1.8 1.7 1.4 2.5 2.2
>> 1.9 2.5 1.7 1.7 0.1
>> Avg: 1.8, Min: 0.1, Max: 2.5]
>> [Processed Buffers : 8 1 3 1 1 7 3 2 6 2 7 8 3
>> Sum: 52, Avg: 4, Min: 1, Max: 8]
>> [Ext Root Scanning (ms): 0.7 0.5 0.5 0.3 0.5 0.7 0.4
>> 0.5 0.4 0.5 0.3 0.3 0.0
>> Avg: 0.4, Min: 0.0, Max: 0.7]
>> [Mark Stack Scanning (ms): 0.0 0.0 0.0 0.0 0.0 0.0
>> 0.0 0.0 0.0 0.0 0.0 0.0 0.0
>> Avg: 0.0, Min: 0.0, Max: 0.0]
>> [Scan RS (ms): 0.9 0.4 0.1 0.7 0.7 0.9 0.0 0.1 0.6
>> 0.0 0.8 0.8 0.9
>> Avg: 0.5, Min: 0.0, Max: 0.9]
>> [Object Copy (ms): 7.2 7.2 7.3 7.3 7.1 7.1 7.0 7.2
>> 7.1 6.9 7.1 7.1 7.0
>> Avg: 7.1, Min: 6.9, Max: 7.3]
>> [Termination (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
>> 0.0 0.0 0.0 0.0 0.0
>> Avg: 0.0, Min: 0.0, Max: 0.0]
>> [Other: 0.6 ms]
>> [Clear CT: 0.5 ms]
>> [Other: 1.1 ms]
>> [Choose CSet: 0.0 ms]
>> [ 7677M->7636M(8000M)]
>> [Times: user=0.12 sys=0.00, real=0.01 secs]
>>
>> But every 5-10 minutes I see a GC pause that lasts 10-15 seconds:
>> [todd@monster01 logs]$ grep 'Full GC' gc-hbase.log | tail
>> 2010-07-06T12:50:41.216-0700: 52521.446: [Full GC
>> 7934M->4865M(8000M), 9.8907800 secs]
>> 2010-07-06T12:55:39.802-0700: 52820.032: [Full GC
>> 7930M->4964M(8000M), 9.9025520 secs]
>> 2010-07-06T13:02:26.872-0700: 53227.102: [Full GC
>> 7934M->4882M(8000M), 10.1232190 secs]
>> 2010-07-06T13:09:41.049-0700: 53661.279: [Full GC
>> 7938M->5002M(8000M), 10.4997760 secs]
>> 2010-07-06T13:18:51.531-0700: 54211.761: [Full GC
>> 7938M->4962M(8000M), 11.0497380 secs]
>>
>> These pauses are pretty unacceptable for soft real time operation.
>>
>> Am I missing some tuning that should be done for G1GC for
>> applications like this? Is 20ms out of 80ms too aggressive a
>> target for the garbage rates we're generating?
>>
>> My actual live heap usage should be very stable around 5GB - the
>> application very carefully accounts its live object set at around
>> 60% of the max heap (as you can see in the logs above).
>>
>> At this point we are considering doing crazy things like ripping
>> out our main memory consumers into a custom slab allocator, and
>> manually reference count the byte array slices. But, if we can get
>> G1GC to work for us, it will save a lot of engineering on the
>> application side!
>>
>> Thanks
>> -Todd
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>> ------------------------------------------------------------------------
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
On 07/07/10 08:45, Todd Lipcon wrote:
...
>
> Overnight I saw one "concurrent mode failure".
...
> 2010-07-07T07:56:27.786-0700: 28490.203: [GC 28490.203: [ParNew
> (promotion failed): 59008K->59008K(59008K), 0.0179250 secs]28490.221:
> [CMS2010-07-07T07:56:27.901-0700: 28490.317: [CMS-concurrent-preclean:
> 0.556/0.947 secs] [Times:
> user=5.76 sys=0.26, real=0.95 secs]
> (concurrent mode failure): 6359176K->4206871K(8323072K), 17.4366220
> secs] 6417373K->4206871K(8382080K), [CMS Perm : 18609K->18565K(31048K)],
> 17.4546890 secs] [Times: user=11.17 sys=0.09, real=17.45 secs]
>
> I've interpreted pauses like this as being caused by fragmentation,
> since the young gen is 64M, and the old gen here has about 2G free. If
> there's something I'm not understanding about CMS, and I can tune it
> more smartly to avoid these longer pauses, I'm happy to try.
Yes the old gen must be fragmented. I'll look at the data you have
made available (for CMS). The CMS log you uploaded does not have the
suffix leading into the concurrent mode failure ypu display above
(it stops less than 2500 s into the run). If you could include
the entire log leading into the concurrent mode failures, it would
be a great help. Do you have large arrays in your
application? The shape of the promotion graph for CMS is somewhat
jagged, indicating _perhaps_ that. Yes, +PrintTenuringDistribution
would shed a bit more light. As regards fragmentation, it can be
tricky to tune against, but we can try once we understand a bit
more about the object sizes and demographics.
I am sure you don't have an easily shared test case, so we
can reproduce both the CMS fragmentation and the G1 full gc
issues locally for quickest progress on this?
-- ramki
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
On 07/07/10 17:18, Todd Lipcon wrote:
...
> Looking at the graph you attached, it appears that the low-water mark
> stabilizes at somewhere between 4.5G and 5G. The configuration I'm
> running is to allocate 40% of the heap to Memstore and 20% of the heap
> to the LRU cache. For an 8G heap, this is 4.8GB. So, for this
> application it's somewhat expected that, as it runs, it will accumulate
> more and more data until it reaches this threshold. The data is, of
> course, not *permanent*, but it's reasonably long-lived, so it makes
> sense to me that it should go into the old generation.
Ah, i see. In that case, i think you could try using a slightly larger old
gen. If the old gen stabilizes at 4.2 GB, we should allow as much for slop.
i.e. make the old gen 8.4 GB (or whatever is the measured stable
old gen occupancy), then add to that the young gen size, and use
that for the whole heap. I would be even more aggressive
and grant more to the old gen -- as i said earlier perhaps
double the old gen from its present size. If that doesn;t work
we know that something is amiss in the way we are going at this.
If it works, we can iterate downwards from a config that we know
works, down to what may be considered an acceptable space overhead
for GC.
>
> If you like, I can tune down those percentages to 20/20 instead of
> 20/40, and I think we'll see the same pattern, just stabilized around
> 3.2GB. This will probably delay the full GCs, but still eventually hit
> them. It's also way lower than we can really go - customers won't like
> "throwing away" 60% of the allocated heap to GC!
I understand that sentiment. I want us to get to a state where we are able
to completely avoid the creeping fragmentation, if possible. There are
other ways to tune for this, but they are more labour-intensive and tricky,
and I would not want to go into that lightly. You might want to contact
your Java support for help with that.
>
>
> Perhaps jmap -histo:live or +PrintClassHistogram[AfterFullGC]
> will help you get to the bottom of yr leak. Once the leak is plugged
> perhaps we could come back to the G1 tuning effort? (We have some
> guesses as to what might be happening and the best G1 minds are
> chewing on the info you provided so far, for which thanks!)
>
>
> I can try running with those options and see what I see, but I've
> already spent some time looking at heap dumps, and not found any leaks,
> so I'm pretty sure it's not the issue.
OK, in that case it's not worth doing, since you've already ruled
out leaks.
I'll think some more about this meanwhile.
thanks.
-- ramki
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
On 07/07/10 17:32, Todd Lipcon wrote:
...
> OK, I can try some tests with cache configured for only 40% heap usage.
> Should I run these tests with CMS or G1?
I'd first try CMS, and if that works, try G1.
>
>
>
>
> If you like, I can tune down those percentages to 20/20 instead
> of 20/40, and I think we'll see the same pattern, just
> stabilized around 3.2GB. This will probably delay the full GCs,
> but still eventually hit them. It's also way lower than we can
> really go - customers won't like "throwing away" 60% of the
> allocated heap to GC!
>
>
> I understand that sentiment. I want us to get to a state where we
> are able
> to completely avoid the creeping fragmentation, if possible. There are
> other ways to tune for this, but they are more labour-intensive and
> tricky,
> and I would not want to go into that lightly. You might want to contact
> your Java support for help with that.
>
>
> Yep, we've considered various solutions involving managing our own
> ref-counted slices of a single pre-allocated byte array - essentially
> writing our own slab allocator. In theory this should make all of the
> GCable objects constrained to a small number of sizes, and thus prevent
> fragmentation, but it's quite a project to undertake :)
That would be overdoing it. I didn't mean anything so drastic and certainly
nothing so drastic at the application level. When I said "labour intensive"
I meant tuning GC to avoid that kind of fragmentation would be more work.
>
> Regarding Java support, as an open source project we have no such
> luxury. Projects like HBase and Hadoop, though, are pretty visible to
> users as "big Java apps", so getting them working well on the GC front
> does good things for Java adoption in the database/distributed systems
> community, I think.
I agree, and we certainly should.
-- ramki
>
> -Todd
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>
>
> ------------------------------------------------------------------------
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
> Am I missing some tuning that should be done for G1GC for applications like
> this? Is 20ms out of 80ms too aggressive a target for the garbage rates
> we're generating?
I have never run HBase, but in an LRU stress test (I posted about it a
few months ago) I specifically observed remembered set scanning costs
go way up. In addition I was seeing fallbacks to full GC:s recently in
a slightly different test that I also posed about to -use, and that
turned out to be a result of the estimated rset scanning costs being
so high that regions were never selected for eviction even though they
had very little live data. I would be very interested to hear if
you're having the same problem. My last post on the topic is here:
http://mail.openjdk.java.net/pipermail/hotspot-gc-use/2010-June/000652.html
Including the link to the (throw-away) patch that should tell you
whether this is what's happening:
http://distfiles.scode.org/mlref/g1/g1_region_live_stats_hack.patch
Out of personal curiosity I'd be very interested to hear whether this
is what's happening to you (in a real reasonable use-case rather than
a synthetic benchmark).
My sense (and hotspot/g1 developers please smack me around if I am
misrepresenting anything here) is that the effect I saw (with rset
scanning costs) could cause perpetual memory grow (until fallback to
full GC) in two ways:
(1) The estimated (and possibly real) cost of rset scanning for a
single region could be so high that it is never possible to select it
for eviction given the asked for pause time goals. Hence, such a
region effectively "leaks" until full GC.
(2) The estimated (and possibly real) cost of rset scanning for
regions may be so high that there are, in practice, always other
regions selected for high pay-off/cost ratios, such that they end up
never being collected even if theoretically a single region could be
evicted within the pause time goal.
These are effectively the same thing, with (1) being an extreme case of (2).
In both cases, the effect should be mitigated (and have been in the
case where I did my testing), but as far as I can tell not generally
"fixed", by increasing the pause time goals.
It is unclear to me how this is intended to be handled. The original
g1 paper mentions an rset scanning thread that I may suspect would be
intended to help do rset scanning in the background such that regions
like these could be evicted more cheaply during the STW eviction
pause; but I didn't find such a thread anywhere in the source code -
but I may very well just be missing it.
--
/ Peter Schuller
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
Hi Peter --
Yes, my guess was also that something (possibly along the lines
you stated below) was preventing the selection of certain (sets
of) regions for evacuation on a regular basis ... I am told there
are flags that will allow you to get verbose details on what is
or is not selected for inclusion in the collection set; perhaps
that will help you get down to the bottom of this. Did you say
you had a test case that showed this behaviour? Filing a bug
with that test case may be the quickest way to get this before
the right set of eyes. Over to the G1 cognoscenti.
-- ramki
On 07/12/10 09:02, Peter Schuller wrote:
>> Am I missing some tuning that should be done for G1GC for applications like
>> this? Is 20ms out of 80ms too aggressive a target for the garbage rates
>> we're generating?
>
> I have never run HBase, but in an LRU stress test (I posted about it a
> few months ago) I specifically observed remembered set scanning costs
> go way up. In addition I was seeing fallbacks to full GC:s recently in
> a slightly different test that I also posed about to -use, and that
> turned out to be a result of the estimated rset scanning costs being
> so high that regions were never selected for eviction even though they
> had very little live data. I would be very interested to hear if
> you're having the same problem. My last post on the topic is here:
>
> http://mail.openjdk.java.net/pipermail/hotspot-gc-use/2010-June/000652.html
>
> Including the link to the (throw-away) patch that should tell you
> whether this is what's happening:
>
> http://distfiles.scode.org/mlref/g1/g1_region_live_stats_hack.patch
>
> Out of personal curiosity I'd be very interested to hear whether this
> is what's happening to you (in a real reasonable use-case rather than
> a synthetic benchmark).
>
> My sense (and hotspot/g1 developers please smack me around if I am
> misrepresenting anything here) is that the effect I saw (with rset
> scanning costs) could cause perpetual memory grow (until fallback to
> full GC) in two ways:
>
> (1) The estimated (and possibly real) cost of rset scanning for a
> single region could be so high that it is never possible to select it
> for eviction given the asked for pause time goals. Hence, such a
> region effectively "leaks" until full GC.
>
> (2) The estimated (and possibly real) cost of rset scanning for
> regions may be so high that there are, in practice, always other
> regions selected for high pay-off/cost ratios, such that they end up
> never being collected even if theoretically a single region could be
> evicted within the pause time goal.
>
> These are effectively the same thing, with (1) being an extreme case of (2).
>
> In both cases, the effect should be mitigated (and have been in the
> case where I did my testing), but as far as I can tell not generally
> "fixed", by increasing the pause time goals.
>
> It is unclear to me how this is intended to be handled. The original
> g1 paper mentions an rset scanning thread that I may suspect would be
> intended to help do rset scanning in the background such that regions
> like these could be evicted more cheaply during the STW eviction
> pause; but I didn't find such a thread anywhere in the source code -
> but I may very well just be missing it.
>
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
Peter and Todd,
Any chance of setting +PrintHeapAtGC and -XX:+PrintHeapAtGCExtended and
sending us the log, or part of it (say between two Full GCs)? Be
prepared: this will generate piles of output. But it will give us
per-region information that might shed more light on the cause of the
issue.... thanks,
Tony, HS GC Group
Peter Schuller wrote:
>> Am I missing some tuning that should be done for G1GC for applications like
>> this? Is 20ms out of 80ms too aggressive a target for the garbage rates
>> we're generating?
>>
>
> I have never run HBase, but in an LRU stress test (I posted about it a
> few months ago) I specifically observed remembered set scanning costs
> go way up. In addition I was seeing fallbacks to full GC:s recently in
> a slightly different test that I also posed about to -use, and that
> turned out to be a result of the estimated rset scanning costs being
> so high that regions were never selected for eviction even though they
> had very little live data. I would be very interested to hear if
> you're having the same problem. My last post on the topic is here:
>
> http://mail.openjdk.java.net/pipermail/hotspot-gc-use/2010-June/000652.html
>
> Including the link to the (throw-away) patch that should tell you
> whether this is what's happening:
>
> http://distfiles.scode.org/mlref/g1/g1_region_live_stats_hack.patch
>
> Out of personal curiosity I'd be very interested to hear whether this
> is what's happening to you (in a real reasonable use-case rather than
> a synthetic benchmark).
>
> My sense (and hotspot/g1 developers please smack me around if I am
> misrepresenting anything here) is that the effect I saw (with rset
> scanning costs) could cause perpetual memory grow (until fallback to
> full GC) in two ways:
>
> (1) The estimated (and possibly real) cost of rset scanning for a
> single region could be so high that it is never possible to select it
> for eviction given the asked for pause time goals. Hence, such a
> region effectively "leaks" until full GC.
>
> (2) The estimated (and possibly real) cost of rset scanning for
> regions may be so high that there are, in practice, always other
> regions selected for high pay-off/cost ratios, such that they end up
> never being collected even if theoretically a single region could be
> evicted within the pause time goal.
>
> These are effectively the same thing, with (1) being an extreme case of (2).
>
> In both cases, the effect should be mitigated (and have been in the
> case where I did my testing), but as far as I can tell not generally
> "fixed", by increasing the pause time goals.
>
> It is unclear to me how this is intended to be handled. The original
> g1 paper mentions an rset scanning thread that I may suspect would be
> intended to help do rset scanning in the background such that regions
> like these could be evicted more cheaply during the STW eviction
> pause; but I didn't find such a thread anywhere in the source code -
> but I may very well just be missing it.
>
>
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
Ramki/Tony,
> Any chance of setting +PrintHeapAtGC and -XX:+PrintHeapAtGCExtended and
> sending us the log, or part of it (say between two Full GCs)? Be prepared:
> this will generate piles of output. But it will give us per-region
> information that might shed more light on the cause of the issue.... thanks,
So what I have in terms of data is (see footnotes for urls references in []):
(a) A patch[1] that prints some additional information about estimated
costs of region eviction, and disables the GC efficiency check that
normally terminates selection of regions. (Note: This is a throw-away
patch for debugging; it's not intended as a suggested change for
inclusion.)
(b) A log[2] showing the output of a test run I did just now, with
both your flags above and my patch enabled (but without disabling the
efficiency check). It shows fallback to full GC when the actual live
set size is 252 MB, and the maximum heap size is 2 GB (in other words,
~ 12% liveness). An easy way to find the point of full gc is to search
for the string 'full 1'.
(c) A file[3] with the effective VM options during the test.
(d) Instructions for how to run the test to reproduce it (I'll get to
that at the end; it's simplified relative to previously).
(e) Nature of the test.
Discussion:
WIth respect to region information: I originally tried it in response
to your recommendation earlier, but I found I did not see the
information I was after. Perhaps I was just misreading it, but I
mostly just saw either 0% or 100% fullness, and never the actual
liveness estimate as produced by the mark phase. In the log I am
referring to in this E-Mail, you can see that the last printout of
region information just before the live GC fits this pattern; I just
don't see anything that looks like legitimate liveness information
being printed. (I don't have time to dig back into it right now to
double-check what it's printing.)
If you scroll up from the point of the full gc until you find a bunch
of output starting with "predict_region_elapsed_time_ms" you see some
output resulting from the patch, with pretty extreme values such as:
predict_region_elapsed_time_ms: 34.378642ms total, 34.021154ms rs scan
(46542 cnum), 0.040069 copy time (20704 bytes), 0.317419 other time
predict_region_elapsed_time_ms: 45.020866ms total, 44.653222ms rs scan
(61087 cnum), 0.050225 copy time (25952 bytes), 0.317419 other time
predict_region_elapsed_time_ms: 16.250033ms total, 15.887065ms rs scan
(21734 cnum), 0.045550 copy time (23536 bytes), 0.317419 other time
predict_region_elapsed_time_ms: 226.942877ms total, 226.559163ms rs
scan (309940 cnum), 0.066296 copy time (34256 bytes), 0.317419 other
time
predict_region_elapsed_time_ms: 542.344828ms total, 541.954750ms rs
scan (741411 cnum), 0.072659 copy time (37544 bytes), 0.317419 other
time
predict_region_elapsed_time_ms: 668.595597ms total, 668.220877ms rs
scan (914147 cnum), 0.057301 copy time (29608 bytes), 0.317419 other
time
So in the most extreme case in the excerpt above, that's > half a
second of estimate rset scanning time for a single region with 914147
cards to be scanned. While not all are that extreme, lots and lots of
regions are very expensive and almost only due to rset scanning costs.
If you scroll down a bit to the first (and ONLY) partial that happened
after the statistics accumulating from the marking phase, we see more
output resulting form the patch. At the end, we see:
(picked region; 0.345890ms predicted; 1.317244ms remaining; 15kb
marked; 15kb maxlive; 1-1% liveness)
(393380 KB left in heap.)
(picked region; 0.345963ms predicted; 0.971354ms remaining; 15kb
marked; 15kb maxlive; 1-1% liveness)
(393365 KB left in heap.)
(picked region; 0.346036ms predicted; 0.625391ms remaining; 15kb
marked; 15kb maxlive; 1-1% liveness)
(393349 KB left in heap.)
(no more marked regions; next region too expensive (adaptive;
predicted 0.346036ms > remaining 0.279355ms))
So in other words, it picked a bunch of regions in order of "lowest
hanging fruit". The *least* low hanging fruit picked still had
liveness at 1%; in other words, there's plenty of further regions that
ideally should be collected because they contain almost no garbage
(ignoring the cost of collecting them).
In this case, it stopped picking regions because the next region to be
picked, though cheap, was the straw that broke the camel's neck and we
simply exceeded the alloted time for this particular GC.
However, after this partial completes, it reverts back to doing just
young gc:s. In other words, even though there's *plenty* of regions
with very low liveness, further partials aren't happening.
By applying this part of the patch:
- (adaptive_young_list_length() &&
+ (adaptive_young_list_length() && false && // scodetodo
I artificially force g1 to not fall back to doing young gc:s for
efficiency reasons. When I run with that change, I don't experience
the slow perpetual growth until fallback to full GC. If I remember
correctly though, the rset scanning cost is in fact high, but I don't
have details saved and I'm afraid I don't have time to re-run those
tests right now and compare numbers.
Reproducing it:
I made some changes and the test case should now hopefully be easy to
run assuming you have maven installed. The github project is at:
http://github.com/scode/httpgctest
There is a README, but the shortest possible instructions to
re-produce the test that I did:
git clone git://github.com/scode/httpgctest.git
cd httpgctest.git
git checkout 20100714_1 # grab from appropriate tag, in case I
change master
mvn package
HTTPGCTEST_LOGGC=gc.log ./run.sh
That should start the http server; then run concurrently:
while [ 1 ] ; do curl 'http://localhost:9191/dropdata?ratio=0.10' ;
curl 'http://localhost:9191/gendata?amount=25000' ; sleep 0.1 ; done
And then just wait and observe.
Nature of the test:
So the test if run as above will essentially reach a steady state of
equilibrium with about 25000 pieces of data in a clojure immutable
map. The result is that a significant amount of new data is being
allocated, but very little writing to old regions is happening. The
garbage generated is very well spread out over the entire heap because
it goes through all objects and drops 10% (the ratio=0.10) for each
iteration, after which it adds 25000 new items.
In other words; not a lot of old gen writing, but lots of writes to
the young gen referencing objects in the old gen.
[1] http://distfiles.scode.org/mlref/g1/g1_region_live_stats_hack.patch
[2] http://distfiles.scode.org/mlref/gctest/httpgctest-g1-fullgc-20100714/gc-fullfallback.log
[3] http://distfiles.scode.org/mlref/gctest/httpgctest-g1-fullgc-20100714/vmoptions.txt
--
/ Peter Schuller
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
|
# 12

30-07-2010 09:47 PM
|
|
|
Todd,
Could you send a segment of the GC logs from the beginning
through the first dozen or so full GC's?
Exactly which version of the JVM are you using?
java -version
will tell us.
Do you have a test setup where you could do some experiments?
Can you send the set of CMS flags you use? It might tell
us something about the GC behavior of you application.
Might not tell us anything but it's worth a look.
Jon
On 07/06/10 13:27, Todd Lipcon wrote:
> Hi all,
>
> I work on HBase, a distributed database written in Java. We generally
> run on large heaps (8GB+), and our object lifetime distribution has
> proven pretty problematic for garbage collection (we manage a multi-GB
> LRU cache inside the process, so in CMS we tenure a lot of byte arrays
> which later get collected).
>
> In Java6, we generally run with the CMS collector, tuning down
> CMSInitiatingOccupancyFraction and constraining MaxNewSize to achieve
> fairly low pause GC, but after a week or two of uptime we often run
> into full heap compaction which takes several minutes and wreaks havoc
> on the system.
>
> Needless to say, we've been watching the development of the G1 GC with
> anticipation for the last year or two. Finally it seems in the latest
> build of JDK7 it's stable enough for actual use (previously it would
> segfault within an hour or so). However, in my testing I'm seeing a
> fair amount of 8-10 second Full GC pauses.
>
> The flags I'm passing are:
>
> -Xmx8000m -XX:+UseG1GC -XX:MaxGCPauseMillis=20
> -XX:GCPauseIntervalMillis=80
>
> Most of the pauses I see in the GC log are around 10-20ms as expected:
>
> 2010-07-05T22:43:19.849-0700: 1680.079: [GC pause (partial),
> 0.01209000 secs]
> [Parallel Time: 10.5 ms]
> [Update RS (Start) (ms): 1680080.2 1680080.1 1680080.2
> 1680079.9 1680080.0 1680080.2 1680080.1 1680080.1 1680080.0
> 1680080.1 1680080.0 1680079.9 1680081.5]
> [Update RS (ms): 1.4 2.0 2.2 1.8 1.7 1.4 2.5 2.2 1.9
> 2.5 1.7 1.7 0.1
> Avg: 1.8, Min: 0.1, Max: 2.5]
> [Processed Buffers : 8 1 3 1 1 7 3 2 6 2 7 8 3
> Sum: 52, Avg: 4, Min: 1, Max: 8]
> [Ext Root Scanning (ms): 0.7 0.5 0.5 0.3 0.5 0.7 0.4 0.5
> 0.4 0.5 0.3 0.3 0.0
> Avg: 0.4, Min: 0.0, Max: 0.7]
> [Mark Stack Scanning (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0
> 0.0 0.0 0.0 0.0 0.0 0.0
> Avg: 0.0, Min: 0.0, Max: 0.0]
> [Scan RS (ms): 0.9 0.4 0.1 0.7 0.7 0.9 0.0 0.1 0.6 0.0
> 0.8 0.8 0.9
> Avg: 0.5, Min: 0.0, Max: 0.9]
> [Object Copy (ms): 7.2 7.2 7.3 7.3 7.1 7.1 7.0 7.2 7.1
> 6.9 7.1 7.1 7.0
> Avg: 7.1, Min: 6.9, Max: 7.3]
> [Termination (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
> 0.0 0.0 0.0 0.0
> Avg: 0.0, Min: 0.0, Max: 0.0]
> [Other: 0.6 ms]
> [Clear CT: 0.5 ms]
> [Other: 1.1 ms]
> [Choose CSet: 0.0 ms]
> [ 7677M->7636M(8000M)]
> [Times: user=0.12 sys=0.00, real=0.01 secs]
>
> But every 5-10 minutes I see a GC pause that lasts 10-15 seconds:
> [todd@monster01 logs]$ grep 'Full GC' gc-hbase.log | tail
> 2010-07-06T12:50:41.216-0700: 52521.446: [Full GC 7934M->4865M(8000M),
> 9.8907800 secs]
> 2010-07-06T12:55:39.802-0700: 52820.032: [Full GC 7930M->4964M(8000M),
> 9.9025520 secs]
> 2010-07-06T13:02:26.872-0700: 53227.102: [Full GC 7934M->4882M(8000M),
> 10.1232190 secs]
> 2010-07-06T13:09:41.049-0700: 53661.279: [Full GC 7938M->5002M(8000M),
> 10.4997760 secs]
> 2010-07-06T13:18:51.531-0700: 54211.761: [Full GC 7938M->4962M(8000M),
> 11.0497380 secs]
>
> These pauses are pretty unacceptable for soft real time operation.
>
> Am I missing some tuning that should be done for G1GC for applications
> like this? Is 20ms out of 80ms too aggressive a target for the garbage
> rates we're generating?
>
> My actual live heap usage should be very stable around 5GB - the
> application very carefully accounts its live object set at around 60%
> of the max heap (as you can see in the logs above).
>
> At this point we are considering doing crazy things like ripping out
> our main memory consumers into a custom slab allocator, and manually
> reference count the byte array slices. But, if we can get G1GC to work
> for us, it will save a lot of engineering on the application side!
>
> Thanks
> -Todd
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
> ------------------------------------------------------------------------
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
Did you try doubling the heap size? You might want to post a full
log so we can see what's happening between those full collections.
Also, If you have comparable CMS logs
all the better, as a known starting point. The full gc's almost
look like the heap got too full, so it must mean that incremental
collection is not keeping up with the rate of garbage generation.
Also, what's the JDK build you are running?
-- ramki
On 07/06/10 13:27, Todd Lipcon wrote:
> Hi all,
>
> I work on HBase, a distributed database written in Java. We generally
> run on large heaps (8GB+), and our object lifetime distribution has
> proven pretty problematic for garbage collection (we manage a multi-GB
> LRU cache inside the process, so in CMS we tenure a lot of byte arrays
> which later get collected).
>
> In Java6, we generally run with the CMS collector, tuning down
> CMSInitiatingOccupancyFraction and constraining MaxNewSize to achieve
> fairly low pause GC, but after a week or two of uptime we often run into
> full heap compaction which takes several minutes and wreaks havoc on the
> system.
>
> Needless to say, we've been watching the development of the G1 GC with
> anticipation for the last year or two. Finally it seems in the latest
> build of JDK7 it's stable enough for actual use (previously it would
> segfault within an hour or so). However, in my testing I'm seeing a fair
> amount of 8-10 second Full GC pauses.
>
> The flags I'm passing are:
>
> -Xmx8000m -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:GCPauseIntervalMillis=80
>
> Most of the pauses I see in the GC log are around 10-20ms as expected:
>
> 2010-07-05T22:43:19.849-0700: 1680.079: [GC pause (partial), 0.01209000
> secs]
> [Parallel Time: 10.5 ms]
> [Update RS (Start) (ms): 1680080.2 1680080.1 1680080.2
> 1680079.9 1680080.0 1680080.2 1680080.1 1680080.1 1680080.0
> 1680080.1 1680080.0 1680079.9 1680081.5]
> [Update RS (ms): 1.4 2.0 2.2 1.8 1.7 1.4 2.5 2.2 1.9 2.5
> 1.7 1.7 0.1
> Avg: 1.8, Min: 0.1, Max: 2.5]
> [Processed Buffers : 8 1 3 1 1 7 3 2 6 2 7 8 3
> Sum: 52, Avg: 4, Min: 1, Max: 8]
> [Ext Root Scanning (ms): 0.7 0.5 0.5 0.3 0.5 0.7 0.4 0.5
> 0.4 0.5 0.3 0.3 0.0
> Avg: 0.4, Min: 0.0, Max: 0.7]
> [Mark Stack Scanning (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
> 0.0 0.0 0.0 0.0 0.0
> Avg: 0.0, Min: 0.0, Max: 0.0]
> [Scan RS (ms): 0.9 0.4 0.1 0.7 0.7 0.9 0.0 0.1 0.6 0.0
> 0.8 0.8 0.9
> Avg: 0.5, Min: 0.0, Max: 0.9]
> [Object Copy (ms): 7.2 7.2 7.3 7.3 7.1 7.1 7.0 7.2 7.1
> 6.9 7.1 7.1 7.0
> Avg: 7.1, Min: 6.9, Max: 7.3]
> [Termination (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
> 0.0 0.0 0.0 0.0
> Avg: 0.0, Min: 0.0, Max: 0.0]
> [Other: 0.6 ms]
> [Clear CT: 0.5 ms]
> [Other: 1.1 ms]
> [Choose CSet: 0.0 ms]
> [ 7677M->7636M(8000M)]
> [Times: user=0.12 sys=0.00, real=0.01 secs]
>
> But every 5-10 minutes I see a GC pause that lasts 10-15 seconds:
> [todd@monster01 logs]$ grep 'Full GC' gc-hbase.log | tail
> 2010-07-06T12:50:41.216-0700: 52521.446: [Full GC 7934M->4865M(8000M),
> 9.8907800 secs]
> 2010-07-06T12:55:39.802-0700: 52820.032: [Full GC 7930M->4964M(8000M),
> 9.9025520 secs]
> 2010-07-06T13:02:26.872-0700: 53227.102: [Full GC 7934M->4882M(8000M),
> 10.1232190 secs]
> 2010-07-06T13:09:41.049-0700: 53661.279: [Full GC 7938M->5002M(8000M),
> 10.4997760 secs]
> 2010-07-06T13:18:51.531-0700: 54211.761: [Full GC 7938M->4962M(8000M),
> 11.0497380 secs]
>
> These pauses are pretty unacceptable for soft real time operation.
>
> Am I missing some tuning that should be done for G1GC for applications
> like this? Is 20ms out of 80ms too aggressive a target for the garbage
> rates we're generating?
>
> My actual live heap usage should be very stable around 5GB - the
> application very carefully accounts its live object set at around 60% of
> the max heap (as you can see in the logs above).
>
> At this point we are considering doing crazy things like ripping out our
> main memory consumers into a custom slab allocator, and manually
> reference count the byte array slices. But, if we can get G1GC to work
> for us, it will save a lot of engineering on the application side!
>
> Thanks
> -Todd
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>
>
> ------------------------------------------------------------------------
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
I also work with Todd on this systems (I am one of the other people
with the alternate CMS config) and doubling the heap size from 8 GB to
16GB is a little insane... we'd like to have some amount of reasonable
memory efficiency here... The thing is the more we can get out of our
ram for this block cache, the better performing our systems are. Also
a lot of the settings are self tuning, so if we up the Xmx the size of
the block cache is scaled as well.
-ryan
On Tue, Jul 6, 2010 at 2:12 PM, Y. S. Ramakrishna
<> wrote:
> Did you try doubling the heap size? You might want to post a full
> log so we can see what's happening between those full collections.
> Also, If you have comparable CMS logs
> all the better, as a known starting point. The full gc's almost
> look like the heap got too full, so it must mean that incremental
> collection is not keeping up with the rate of garbage generation.
> Also, what's the JDK build you are running?
>
> -- ramki
>
> On 07/06/10 13:27, Todd Lipcon wrote:
>> Hi all,
>>
>> I work on HBase, a distributed database written in Java. We generally
>> run on large heaps (8GB+), and our object lifetime distribution has
>> proven pretty problematic for garbage collection (we manage a multi-GB
>> LRU cache inside the process, so in CMS we tenure a lot of byte arrays
>> which later get collected).
>>
>> In Java6, we generally run with the CMS collector, tuning down
>> CMSInitiatingOccupancyFraction and constraining MaxNewSize to achieve
>> fairly low pause GC, but after a week or two of uptime we often run into
>> full heap compaction which takes several minutes and wreaks havoc on the
>> system.
>>
>> Needless to say, we've been watching the development of the G1 GC with
>> anticipation for the last year or two. Finally it seems in the latest
>> build of JDK7 it's stable enough for actual use (previously it would
>> segfault within an hour or so). However, in my testing I'm seeing a fair
>> amount of 8-10 second Full GC pauses.
>>
>> The flags I'm passing are:
>>
>> -Xmx8000m -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:GCPauseIntervalMillis=80
>>
>> Most of the pauses I see in the GC log are around 10-20ms as expected:
>>
>> 2010-07-05T22:43:19.849-0700: 1680.079: [GC pause (partial), 0.01209000
>> secs]
>> [Parallel Time: 10.5 ms]
>> [Update RS (Start) (ms): 1680080.2 1680080.1 1680080.2
>> 1680079.9 1680080.0 1680080.2 1680080.1 1680080.1 1680080.0
>> 1680080.1 1680080.0 1680079.9 1680081.5]
>> [Update RS (ms): 1.4 2.0 2.2 1.8 1.7 1.4 2.5 2.2 1.9 2.5
>> 1.7 1.7 0.1
>> Avg: 1.8, Min: 0.1, Max: 2.5]
>> [Processed Buffers : 8 1 3 1 1 7 3 2 6 2 7 8 3
>> Sum: 52, Avg: 4, Min: 1, Max: 8]
>> [Ext Root Scanning (ms): 0.7 0.5 0.5 0.3 0.5 0.7 0.4 0.5
>> 0.4 0.5 0.3 0.3 0.0
>> Avg: 0.4, Min: 0.0, Max: 0.7]
>> [Mark Stack Scanning (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
>> 0.0 0.0 0.0 0.0 0.0
>> Avg: 0.0, Min: 0.0, Max: 0.0]
>> [Scan RS (ms): 0.9 0.4 0.1 0.7 0.7 0.9 0.0 0.1 0.6 0.0
>> 0.8 0.8 0.9
>> Avg: 0.5, Min: 0.0, Max: 0.9]
>> [Object Copy (ms): 7.2 7.2 7.3 7.3 7.1 7.1 7.0 7.2 7.1
>> 6.9 7.1 7.1 7.0
>> Avg: 7.1, Min: 6.9, Max: 7.3]
>> [Termination (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
>> 0.0 0.0 0.0 0.0
>> Avg: 0.0, Min: 0.0, Max: 0.0]
>> [Other: 0.6 ms]
>> [Clear CT: 0.5 ms]
>> [Other: 1.1 ms]
>> [Choose CSet: 0.0 ms]
>> [ 7677M->7636M(8000M)]
>> [Times: user=0.12 sys=0.00, real=0.01 secs]
>>
>> But every 5-10 minutes I see a GC pause that lasts 10-15 seconds:
>> [todd@monster01 logs]$ grep 'Full GC' gc-hbase.log | tail
>> 2010-07-06T12:50:41.216-0700: 52521.446: [Full GC 7934M->4865M(8000M),
>> 9.8907800 secs]
>> 2010-07-06T12:55:39.802-0700: 52820.032: [Full GC 7930M->4964M(8000M),
>> 9.9025520 secs]
>> 2010-07-06T13:02:26.872-0700: 53227.102: [Full GC 7934M->4882M(8000M),
>> 10.1232190 secs]
>> 2010-07-06T13:09:41.049-0700: 53661.279: [Full GC 7938M->5002M(8000M),
>> 10.4997760 secs]
>> 2010-07-06T13:18:51.531-0700: 54211.761: [Full GC 7938M->4962M(8000M),
>> 11.0497380 secs]
>>
>> These pauses are pretty unacceptable for soft real time operation.
>>
>> Am I missing some tuning that should be done for G1GC for applications
>> like this? Is 20ms out of 80ms too aggressive a target for the garbage
>> rates we're generating?
>>
>> My actual live heap usage should be very stable around 5GB - the
>> application very carefully accounts its live object set at around 60% of
>> the max heap (as you can see in the logs above).
>>
>> At this point we are considering doing crazy things like ripping out our
>> main memory consumers into a custom slab allocator, and manually
>> reference count the byte array slices. But, if we can get G1GC to work
>> for us, it will save a lot of engineering on the application side!
>>
>> Thanks
>> -Todd
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
>>
>> ------------------------------------------------------------------------
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
Two questions:
(1) do you enable class unloading with CMS? (I do not see that
below in yr option list, but wondered.)
(2) does your application load classes, or intern a lot of strings?
If i am reading the logs right, G1 appears to reclaim less
and less of the heap in each cycle until a full collection
intervenes, and I have no real explanation for this behaviour
except that perhaps there's something in the perm gen that
keeps stuff in the remainder of the heap artificially live.
G1 does not incrementally collect the young gen, so this is
plausible. But CMS does not either by default and I do not see
that option in the CMS options list you gave below. It would
be instructive to see what the comparable CMS logs look like.
May be then you could start with the same heap shapes for the
two and see if you can get to the bottom of the full gc (which
as i understand you get to more quickly w/G1 than you did
w/CMS).
-- ramki
On 07/06/10 14:24, Todd Lipcon wrote:
> On Tue, Jul 6, 2010 at 2:09 PM, Jon Masamitsu <
> > wrote:
>
> Todd,
>
> Could you send a segment of the GC logs from the beginning
> through the first dozen or so full GC's?
>
>
> Sure, I just put it online at:
>
> http://cloudera-todd.s3.amazonaws.com/gc-log-g1gc.txt
>
>
>
> Exactly which version of the JVM are you using?
>
> java -version
>
> will tell us.
>
>
> Latest as of last night:
>
> [todd@monster01 ~]$ ./jdk1.7.0/jre/bin/java -version
> java version "1.7.0-ea"
> Java(TM) SE Runtime Environment (build 1.7.0-ea-b99)
> Java HotSpot(TM) 64-Bit Server VM (build 19.0-b03, mixed mode)
>
>
> Do you have a test setup where you could do some experiments?
>
>
> Sure, I have a five node cluster here where I do lots of testing, happy
> to try different builds/options/etc (though I probably don't have time
> to apply patches and rebuild the JDK myself)
>
>
> Can you send the set of CMS flags you use? It might tell
> us something about the GC behavior of you application.
> Might not tell us anything but it's worth a look.
>
>
> Different customers have found different flags to work well for them.
> One user uses the following:
>
>
> -Xmx12000m -XX:+HeapDumpOnOutOfMemoryError -XX:+UseConcMarkSweepGC \
> -XX:NewSize=64m -XX:MaxNewSize=64m -XX:CMSInitiatingOccupancyFraction=40 \
>
>
> Another uses:
>
>
> -XX:+DoEscapeAnalysis -XX:+AggressiveOpts -XX:+UseConcMarkSweepGC
> -XX:NewSize=64m -XX:MaxNewSize=64m
> -XX:CMSInitiatingOccupancyFraction=88 -verbose:gc -XX:+PrintGCDetails
>
>
>
>
>
>
>
> The particular tuning options probably depend on the actual cache
> workload of the user. I tend to recommend CMSInitiatingOccupancyFraction
> around 75 or so, since the software maintains about 60% heap usage. I
> also think a NewSize slightly larger would improve things a bit, but if
> it gets more than 256m or so, the ParNew pauses start to be too long for
> a lot of use cases.
>
> Regarding CMS logs, I can probably restart this test later this
> afternoon on CMS and run it for a couple hours, but it isn't likely to
> hit the multi-minute compaction that quickly. It happens more in the wild.
>
> -Todd
>
>
>
> On 07/06/10 13:27, Todd Lipcon wrote:
>> Hi all,
>>
>> I work on HBase, a distributed database written in Java. We
>> generally run on large heaps (8GB+), and our object lifetime
>> distribution has proven pretty problematic for garbage collection
>> (we manage a multi-GB LRU cache inside the process, so in CMS we
>> tenure a lot of byte arrays which later get collected).
>>
>> In Java6, we generally run with the CMS collector, tuning down
>> CMSInitiatingOccupancyFraction and constraining MaxNewSize to
>> achieve fairly low pause GC, but after a week or two of uptime we
>> often run into full heap compaction which takes several minutes
>> and wreaks havoc on the system.
>>
>> Needless to say, we've been watching the development of the G1 GC
>> with anticipation for the last year or two. Finally it seems in
>> the latest build of JDK7 it's stable enough for actual use
>> (previously it would segfault within an hour or so). However, in
>> my testing I'm seeing a fair amount of 8-10 second Full GC pauses.
>>
>> The flags I'm passing are:
>>
>> -Xmx8000m -XX:+UseG1GC -XX:MaxGCPauseMillis=20
>> -XX:GCPauseIntervalMillis=80
>>
>> Most of the pauses I see in the GC log are around 10-20ms as expected:
>>
>> 2010-07-05T22:43:19.849-0700: 1680.079: [GC pause (partial),
>> 0.01209000 secs]
>> [Parallel Time: 10.5 ms]
>> [Update RS (Start) (ms): 1680080.2 1680080.1 1680080.2
>> 1680079.9 1680080.0 1680080.2 1680080.1 1680080.1 1680080.0
>> 1680080.1 1680080.0 1680079.9 1680081.5]
>> [Update RS (ms): 1.4 2.0 2.2 1.8 1.7 1.4 2.5 2.2
>> 1.9 2.5 1.7 1.7 0.1
>> Avg: 1.8, Min: 0.1, Max: 2.5]
>> [Processed Buffers : 8 1 3 1 1 7 3 2 6 2 7 8 3
>> Sum: 52, Avg: 4, Min: 1, Max: 8]
>> [Ext Root Scanning (ms): 0.7 0.5 0.5 0.3 0.5 0.7 0.4
>> 0.5 0.4 0.5 0.3 0.3 0.0
>> Avg: 0.4, Min: 0.0, Max: 0.7]
>> [Mark Stack Scanning (ms): 0.0 0.0 0.0 0.0 0.0 0.0
>> 0.0 0.0 0.0 0.0 0.0 0.0 0.0
>> Avg: 0.0, Min: 0.0, Max: 0.0]
>> [Scan RS (ms): 0.9 0.4 0.1 0.7 0.7 0.9 0.0 0.1 0.6
>> 0.0 0.8 0.8 0.9
>> Avg: 0.5, Min: 0.0, Max: 0.9]
>> [Object Copy (ms): 7.2 7.2 7.3 7.3 7.1 7.1 7.0 7.2
>> 7.1 6.9 7.1 7.1 7.0
>> Avg: 7.1, Min: 6.9, Max: 7.3]
>> [Termination (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
>> 0.0 0.0 0.0 0.0 0.0
>> Avg: 0.0, Min: 0.0, Max: 0.0]
>> [Other: 0.6 ms]
>> [Clear CT: 0.5 ms]
>> [Other: 1.1 ms]
>> [Choose CSet: 0.0 ms]
>> [ 7677M->7636M(8000M)]
>> [Times: user=0.12 sys=0.00, real=0.01 secs]
>>
>> But every 5-10 minutes I see a GC pause that lasts 10-15 seconds:
>> [todd@monster01 logs]$ grep 'Full GC' gc-hbase.log | tail
>> 2010-07-06T12:50:41.216-0700: 52521.446: [Full GC
>> 7934M->4865M(8000M), 9.8907800 secs]
>> 2010-07-06T12:55:39.802-0700: 52820.032: [Full GC
>> 7930M->4964M(8000M), 9.9025520 secs]
>> 2010-07-06T13:02:26.872-0700: 53227.102: [Full GC
>> 7934M->4882M(8000M), 10.1232190 secs]
>> 2010-07-06T13:09:41.049-0700: 53661.279: [Full GC
>> 7938M->5002M(8000M), 10.4997760 secs]
>> 2010-07-06T13:18:51.531-0700: 54211.761: [Full GC
>> 7938M->4962M(8000M), 11.0497380 secs]
>>
>> These pauses are pretty unacceptable for soft real time operation.
>>
>> Am I missing some tuning that should be done for G1GC for
>> applications like this? Is 20ms out of 80ms too aggressive a
>> target for the garbage rates we're generating?
>>
>> My actual live heap usage should be very stable around 5GB - the
>> application very carefully accounts its live object set at around
>> 60% of the max heap (as you can see in the logs above).
>>
>> At this point we are considering doing crazy things like ripping
>> out our main memory consumers into a custom slab allocator, and
>> manually reference count the byte array slices. But, if we can get
>> G1GC to work for us, it will save a lot of engineering on the
>> application side!
>>
>> Thanks
>> -Todd
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>> ------------------------------------------------------------------------
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
On 07/07/10 08:45, Todd Lipcon wrote:
...
>
> Overnight I saw one "concurrent mode failure".
...
> 2010-07-07T07:56:27.786-0700: 28490.203: [GC 28490.203: [ParNew
> (promotion failed): 59008K->59008K(59008K), 0.0179250 secs]28490.221:
> [CMS2010-07-07T07:56:27.901-0700: 28490.317: [CMS-concurrent-preclean:
> 0.556/0.947 secs] [Times:
> user=5.76 sys=0.26, real=0.95 secs]
> (concurrent mode failure): 6359176K->4206871K(8323072K), 17.4366220
> secs] 6417373K->4206871K(8382080K), [CMS Perm : 18609K->18565K(31048K)],
> 17.4546890 secs] [Times: user=11.17 sys=0.09, real=17.45 secs]
>
> I've interpreted pauses like this as being caused by fragmentation,
> since the young gen is 64M, and the old gen here has about 2G free. If
> there's something I'm not understanding about CMS, and I can tune it
> more smartly to avoid these longer pauses, I'm happy to try.
Yes the old gen must be fragmented. I'll look at the data you have
made available (for CMS). The CMS log you uploaded does not have the
suffix leading into the concurrent mode failure ypu display above
(it stops less than 2500 s into the run). If you could include
the entire log leading into the concurrent mode failures, it would
be a great help. Do you have large arrays in your
application? The shape of the promotion graph for CMS is somewhat
jagged, indicating _perhaps_ that. Yes, +PrintTenuringDistribution
would shed a bit more light. As regards fragmentation, it can be
tricky to tune against, but we can try once we understand a bit
more about the object sizes and demographics.
I am sure you don't have an easily shared test case, so we
can reproduce both the CMS fragmentation and the G1 full gc
issues locally for quickest progress on this?
-- ramki
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
On 07/07/10 17:18, Todd Lipcon wrote:
...
> Looking at the graph you attached, it appears that the low-water mark
> stabilizes at somewhere between 4.5G and 5G. The configuration I'm
> running is to allocate 40% of the heap to Memstore and 20% of the heap
> to the LRU cache. For an 8G heap, this is 4.8GB. So, for this
> application it's somewhat expected that, as it runs, it will accumulate
> more and more data until it reaches this threshold. The data is, of
> course, not *permanent*, but it's reasonably long-lived, so it makes
> sense to me that it should go into the old generation.
Ah, i see. In that case, i think you could try using a slightly larger old
gen. If the old gen stabilizes at 4.2 GB, we should allow as much for slop.
i.e. make the old gen 8.4 GB (or whatever is the measured stable
old gen occupancy), then add to that the young gen size, and use
that for the whole heap. I would be even more aggressive
and grant more to the old gen -- as i said earlier perhaps
double the old gen from its present size. If that doesn;t work
we know that something is amiss in the way we are going at this.
If it works, we can iterate downwards from a config that we know
works, down to what may be considered an acceptable space overhead
for GC.
>
> If you like, I can tune down those percentages to 20/20 instead of
> 20/40, and I think we'll see the same pattern, just stabilized around
> 3.2GB. This will probably delay the full GCs, but still eventually hit
> them. It's also way lower than we can really go - customers won't like
> "throwing away" 60% of the allocated heap to GC!
I understand that sentiment. I want us to get to a state where we are able
to completely avoid the creeping fragmentation, if possible. There are
other ways to tune for this, but they are more labour-intensive and tricky,
and I would not want to go into that lightly. You might want to contact
your Java support for help with that.
>
>
> Perhaps jmap -histo:live or +PrintClassHistogram[AfterFullGC]
> will help you get to the bottom of yr leak. Once the leak is plugged
> perhaps we could come back to the G1 tuning effort? (We have some
> guesses as to what might be happening and the best G1 minds are
> chewing on the info you provided so far, for which thanks!)
>
>
> I can try running with those options and see what I see, but I've
> already spent some time looking at heap dumps, and not found any leaks,
> so I'm pretty sure it's not the issue.
OK, in that case it's not worth doing, since you've already ruled
out leaks.
I'll think some more about this meanwhile.
thanks.
-- ramki
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
On 07/07/10 17:32, Todd Lipcon wrote:
...
> OK, I can try some tests with cache configured for only 40% heap usage.
> Should I run these tests with CMS or G1?
I'd first try CMS, and if that works, try G1.
>
>
>
>
> If you like, I can tune down those percentages to 20/20 instead
> of 20/40, and I think we'll see the same pattern, just
> stabilized around 3.2GB. This will probably delay the full GCs,
> but still eventually hit them. It's also way lower than we can
> really go - customers won't like "throwing away" 60% of the
> allocated heap to GC!
>
>
> I understand that sentiment. I want us to get to a state where we
> are able
> to completely avoid the creeping fragmentation, if possible. There are
> other ways to tune for this, but they are more labour-intensive and
> tricky,
> and I would not want to go into that lightly. You might want to contact
> your Java support for help with that.
>
>
> Yep, we've considered various solutions involving managing our own
> ref-counted slices of a single pre-allocated byte array - essentially
> writing our own slab allocator. In theory this should make all of the
> GCable objects constrained to a small number of sizes, and thus prevent
> fragmentation, but it's quite a project to undertake :)
That would be overdoing it. I didn't mean anything so drastic and certainly
nothing so drastic at the application level. When I said "labour intensive"
I meant tuning GC to avoid that kind of fragmentation would be more work.
>
> Regarding Java support, as an open source project we have no such
> luxury. Projects like HBase and Hadoop, though, are pretty visible to
> users as "big Java apps", so getting them working well on the GC front
> does good things for Java adoption in the database/distributed systems
> community, I think.
I agree, and we certainly should.
-- ramki
>
> -Todd
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>
>
> ------------------------------------------------------------------------
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
> Am I missing some tuning that should be done for G1GC for applications like
> this? Is 20ms out of 80ms too aggressive a target for the garbage rates
> we're generating?
I have never run HBase, but in an LRU stress test (I posted about it a
few months ago) I specifically observed remembered set scanning costs
go way up. In addition I was seeing fallbacks to full GC:s recently in
a slightly different test that I also posed about to -use, and that
turned out to be a result of the estimated rset scanning costs being
so high that regions were never selected for eviction even though they
had very little live data. I would be very interested to hear if
you're having the same problem. My last post on the topic is here:
http://mail.openjdk.java.net/pipermail/hotspot-gc-use/2010-June/000652.html
Including the link to the (throw-away) patch that should tell you
whether this is what's happening:
http://distfiles.scode.org/mlref/g1/g1_region_live_stats_hack.patch
Out of personal curiosity I'd be very interested to hear whether this
is what's happening to you (in a real reasonable use-case rather than
a synthetic benchmark).
My sense (and hotspot/g1 developers please smack me around if I am
misrepresenting anything here) is that the effect I saw (with rset
scanning costs) could cause perpetual memory grow (until fallback to
full GC) in two ways:
(1) The estimated (and possibly real) cost of rset scanning for a
single region could be so high that it is never possible to select it
for eviction given the asked for pause time goals. Hence, such a
region effectively "leaks" until full GC.
(2) The estimated (and possibly real) cost of rset scanning for
regions may be so high that there are, in practice, always other
regions selected for high pay-off/cost ratios, such that they end up
never being collected even if theoretically a single region could be
evicted within the pause time goal.
These are effectively the same thing, with (1) being an extreme case of (2).
In both cases, the effect should be mitigated (and have been in the
case where I did my testing), but as far as I can tell not generally
"fixed", by increasing the pause time goals.
It is unclear to me how this is intended to be handled. The original
g1 paper mentions an rset scanning thread that I may suspect would be
intended to help do rset scanning in the background such that regions
like these could be evicted more cheaply during the STW eviction
pause; but I didn't find such a thread anywhere in the source code -
but I may very well just be missing it.
--
/ Peter Schuller
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
Hi Peter --
Yes, my guess was also that something (possibly along the lines
you stated below) was preventing the selection of certain (sets
of) regions for evacuation on a regular basis ... I am told there
are flags that will allow you to get verbose details on what is
or is not selected for inclusion in the collection set; perhaps
that will help you get down to the bottom of this. Did you say
you had a test case that showed this behaviour? Filing a bug
with that test case may be the quickest way to get this before
the right set of eyes. Over to the G1 cognoscenti.
-- ramki
On 07/12/10 09:02, Peter Schuller wrote:
>> Am I missing some tuning that should be done for G1GC for applications like
>> this? Is 20ms out of 80ms too aggressive a target for the garbage rates
>> we're generating?
>
> I have never run HBase, but in an LRU stress test (I posted about it a
> few months ago) I specifically observed remembered set scanning costs
> go way up. In addition I was seeing fallbacks to full GC:s recently in
> a slightly different test that I also posed about to -use, and that
> turned out to be a result of the estimated rset scanning costs being
> so high that regions were never selected for eviction even though they
> had very little live data. I would be very interested to hear if
> you're having the same problem. My last post on the topic is here:
>
> http://mail.openjdk.java.net/pipermail/hotspot-gc-use/2010-June/000652.html
>
> Including the link to the (throw-away) patch that should tell you
> whether this is what's happening:
>
> http://distfiles.scode.org/mlref/g1/g1_region_live_stats_hack.patch
>
> Out of personal curiosity I'd be very interested to hear whether this
> is what's happening to you (in a real reasonable use-case rather than
> a synthetic benchmark).
>
> My sense (and hotspot/g1 developers please smack me around if I am
> misrepresenting anything here) is that the effect I saw (with rset
> scanning costs) could cause perpetual memory grow (until fallback to
> full GC) in two ways:
>
> (1) The estimated (and possibly real) cost of rset scanning for a
> single region could be so high that it is never possible to select it
> for eviction given the asked for pause time goals. Hence, such a
> region effectively "leaks" until full GC.
>
> (2) The estimated (and possibly real) cost of rset scanning for
> regions may be so high that there are, in practice, always other
> regions selected for high pay-off/cost ratios, such that they end up
> never being collected even if theoretically a single region could be
> evicted within the pause time goal.
>
> These are effectively the same thing, with (1) being an extreme case of (2).
>
> In both cases, the effect should be mitigated (and have been in the
> case where I did my testing), but as far as I can tell not generally
> "fixed", by increasing the pause time goals.
>
> It is unclear to me how this is intended to be handled. The original
> g1 paper mentions an rset scanning thread that I may suspect would be
> intended to help do rset scanning in the background such that regions
> like these could be evicted more cheaply during the STW eviction
> pause; but I didn't find such a thread anywhere in the source code -
> but I may very well just be missing it.
>
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
Peter and Todd,
Any chance of setting +PrintHeapAtGC and -XX:+PrintHeapAtGCExtended and
sending us the log, or part of it (say between two Full GCs)? Be
prepared: this will generate piles of output. But it will give us
per-region information that might shed more light on the cause of the
issue.... thanks,
Tony, HS GC Group
Peter Schuller wrote:
>> Am I missing some tuning that should be done for G1GC for applications like
>> this? Is 20ms out of 80ms too aggressive a target for the garbage rates
>> we're generating?
>>
>
> I have never run HBase, but in an LRU stress test (I posted about it a
> few months ago) I specifically observed remembered set scanning costs
> go way up. In addition I was seeing fallbacks to full GC:s recently in
> a slightly different test that I also posed about to -use, and that
> turned out to be a result of the estimated rset scanning costs being
> so high that regions were never selected for eviction even though they
> had very little live data. I would be very interested to hear if
> you're having the same problem. My last post on the topic is here:
>
> http://mail.openjdk.java.net/pipermail/hotspot-gc-use/2010-June/000652.html
>
> Including the link to the (throw-away) patch that should tell you
> whether this is what's happening:
>
> http://distfiles.scode.org/mlref/g1/g1_region_live_stats_hack.patch
>
> Out of personal curiosity I'd be very interested to hear whether this
> is what's happening to you (in a real reasonable use-case rather than
> a synthetic benchmark).
>
> My sense (and hotspot/g1 developers please smack me around if I am
> misrepresenting anything here) is that the effect I saw (with rset
> scanning costs) could cause perpetual memory grow (until fallback to
> full GC) in two ways:
>
> (1) The estimated (and possibly real) cost of rset scanning for a
> single region could be so high that it is never possible to select it
> for eviction given the asked for pause time goals. Hence, such a
> region effectively "leaks" until full GC.
>
> (2) The estimated (and possibly real) cost of rset scanning for
> regions may be so high that there are, in practice, always other
> regions selected for high pay-off/cost ratios, such that they end up
> never being collected even if theoretically a single region could be
> evicted within the pause time goal.
>
> These are effectively the same thing, with (1) being an extreme case of (2).
>
> In both cases, the effect should be mitigated (and have been in the
> case where I did my testing), but as far as I can tell not generally
> "fixed", by increasing the pause time goals.
>
> It is unclear to me how this is intended to be handled. The original
> g1 paper mentions an rset scanning thread that I may suspect would be
> intended to help do rset scanning in the background such that regions
> like these could be evicted more cheaply during the STW eviction
> pause; but I didn't find such a thread anywhere in the source code -
> but I may very well just be missing it.
>
>
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
Ramki/Tony,
> Any chance of setting +PrintHeapAtGC and -XX:+PrintHeapAtGCExtended and
> sending us the log, or part of it (say between two Full GCs)? Be prepared:
> this will generate piles of output. But it will give us per-region
> information that might shed more light on the cause of the issue.... thanks,
So what I have in terms of data is (see footnotes for urls references in []):
(a) A patch[1] that prints some additional information about estimated
costs of region eviction, and disables the GC efficiency check that
normally terminates selection of regions. (Note: This is a throw-away
patch for debugging; it's not intended as a suggested change for
inclusion.)
(b) A log[2] showing the output of a test run I did just now, with
both your flags above and my patch enabled (but without disabling the
efficiency check). It shows fallback to full GC when the actual live
set size is 252 MB, and the maximum heap size is 2 GB (in other words,
~ 12% liveness). An easy way to find the point of full gc is to search
for the string 'full 1'.
(c) A file[3] with the effective VM options during the test.
(d) Instructions for how to run the test to reproduce it (I'll get to
that at the end; it's simplified relative to previously).
(e) Nature of the test.
Discussion:
WIth respect to region information: I originally tried it in response
to your recommendation earlier, but I found I did not see the
information I was after. Perhaps I was just misreading it, but I
mostly just saw either 0% or 100% fullness, and never the actual
liveness estimate as produced by the mark phase. In the log I am
referring to in this E-Mail, you can see that the last printout of
region information just before the live GC fits this pattern; I just
don't see anything that looks like legitimate liveness information
being printed. (I don't have time to dig back into it right now to
double-check what it's printing.)
If you scroll up from the point of the full gc until you find a bunch
of output starting with "predict_region_elapsed_time_ms" you see some
output resulting from the patch, with pretty extreme values such as:
predict_region_elapsed_time_ms: 34.378642ms total, 34.021154ms rs scan
(46542 cnum), 0.040069 copy time (20704 bytes), 0.317419 other time
predict_region_elapsed_time_ms: 45.020866ms total, 44.653222ms rs scan
(61087 cnum), 0.050225 copy time (25952 bytes), 0.317419 other time
predict_region_elapsed_time_ms: 16.250033ms total, 15.887065ms rs scan
(21734 cnum), 0.045550 copy time (23536 bytes), 0.317419 other time
predict_region_elapsed_time_ms: 226.942877ms total, 226.559163ms rs
scan (309940 cnum), 0.066296 copy time (34256 bytes), 0.317419 other
time
predict_region_elapsed_time_ms: 542.344828ms total, 541.954750ms rs
scan (741411 cnum), 0.072659 copy time (37544 bytes), 0.317419 other
time
predict_region_elapsed_time_ms: 668.595597ms total, 668.220877ms rs
scan (914147 cnum), 0.057301 copy time (29608 bytes), 0.317419 other
time
So in the most extreme case in the excerpt above, that's > half a
second of estimate rset scanning time for a single region with 914147
cards to be scanned. While not all are that extreme, lots and lots of
regions are very expensive and almost only due to rset scanning costs.
If you scroll down a bit to the first (and ONLY) partial that happened
after the statistics accumulating from the marking phase, we see more
output resulting form the patch. At the end, we see:
(picked region; 0.345890ms predicted; 1.317244ms remaining; 15kb
marked; 15kb maxlive; 1-1% liveness)
(393380 KB left in heap.)
(picked region; 0.345963ms predicted; 0.971354ms remaining; 15kb
marked; 15kb maxlive; 1-1% liveness)
(393365 KB left in heap.)
(picked region; 0.346036ms predicted; 0.625391ms remaining; 15kb
marked; 15kb maxlive; 1-1% liveness)
(393349 KB left in heap.)
(no more marked regions; next region too expensive (adaptive;
predicted 0.346036ms > remaining 0.279355ms))
So in other words, it picked a bunch of regions in order of "lowest
hanging fruit". The *least* low hanging fruit picked still had
liveness at 1%; in other words, there's plenty of further regions that
ideally should be collected because they contain almost no garbage
(ignoring the cost of collecting them).
In this case, it stopped picking regions because the next region to be
picked, though cheap, was the straw that broke the camel's neck and we
simply exceeded the alloted time for this particular GC.
However, after this partial completes, it reverts back to doing just
young gc:s. In other words, even though there's *plenty* of regions
with very low liveness, further partials aren't happening.
By applying this part of the patch:
- (adaptive_young_list_length() &&
+ (adaptive_young_list_length() && false && // scodetodo
I artificially force g1 to not fall back to doing young gc:s for
efficiency reasons. When I run with that change, I don't experience
the slow perpetual growth until fallback to full GC. If I remember
correctly though, the rset scanning cost is in fact high, but I don't
have details saved and I'm afraid I don't have time to re-run those
tests right now and compare numbers.
Reproducing it:
I made some changes and the test case should now hopefully be easy to
run assuming you have maven installed. The github project is at:
http://github.com/scode/httpgctest
There is a README, but the shortest possible instructions to
re-produce the test that I did:
git clone git://github.com/scode/httpgctest.git
cd httpgctest.git
git checkout 20100714_1 # grab from appropriate tag, in case I
change master
mvn package
HTTPGCTEST_LOGGC=gc.log ./run.sh
That should start the http server; then run concurrently:
while [ 1 ] ; do curl 'http://localhost:9191/dropdata?ratio=0.10' ;
curl 'http://localhost:9191/gendata?amount=25000' ; sleep 0.1 ; done
And then just wait and observe.
Nature of the test:
So the test if run as above will essentially reach a steady state of
equilibrium with about 25000 pieces of data in a clojure immutable
map. The result is that a significant amount of new data is being
allocated, but very little writing to old regions is happening. The
garbage generated is very well spread out over the entire heap because
it goes through all objects and drops 10% (the ratio=0.10) for each
iteration, after which it adds 25000 new items.
In other words; not a lot of old gen writing, but lots of writes to
the young gen referencing objects in the old gen.
[1] http://distfiles.scode.org/mlref/g1/g1_region_live_stats_hack.patch
[2] http://distfiles.scode.org/mlref/gctest/httpgctest-g1-fullgc-20100714/gc-fullfallback.log
[3] http://distfiles.scode.org/mlref/gctest/httpgctest-g1-fullgc-20100714/vmoptions.txt
--
/ Peter Schuller
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
> I consistently am seeing "Other time" estimates in the >40ms range. Given my
> pause time goal of 20ms, these estimates are I think excluding most of the
> regions from collectability. I haven't been able to dig around yet to figure
> out where the long estimate for "other time" is coming from - in the
> collections logged it sometimes shows fairly high "Other" but the "Choose
> CSet" component is very short.
(The following is wannabe speculation based on limited understanding
of the code, please take it with a grain of salt.)
My first thought here is swapping. My reading is that other time is
going to be the collection set selection time plus the collection set
free time (or at least intended to be). I think (am I wrong?) that
this should be really low under normal circumstances since no "bulk"
work is done really; in particular the *per-region* cost should be
low.
If the cost of these operations *per region* ended up being predicted
to > 40ms, I wonder if this was not due to swapping?
Additionally: As far as I can tell the estimated 'other' cost is based
on a history of the cost from previous GC:s and completely independent
of the particular region being evaluated.
Anyways, I suspect you've already confirmed that the system is not
actively swapping at the time of the fallback to full GC. But here is
one low-confidence hypothesis (it would be really great to hear from
one of the gc devs whether it is even remotely plausible):
* At some point in time, there was swapping happening affecting GC
operations such that the work done do gather stats and select regions
was slow (makes some sense since that should touch lots of distinct
regions and you don't need a lot of those memory accesses swapping to
accumulate quite a bit of time).
* This screwed up the 'other' cost history and thus the prediction,
possibly for both young and non-young regions.
* I believe young collections would never be entirely prevented due to
pause time goals, so here the cost history and thus predictions would
always have time to recover and you would not notice any effect
looking at the behavior of the system down the line.
* Non-young "other" cost was so high that non-young regions were never
selected. This in turn meant that additional cost history for the
"other" category was never recorded, preventing recovery from the
temporary swap storm.
* The end result is that no non-young regions are ever collected, and
you end up falling back to full GC once the young collections have
"leaked" enough garbage.
Thoughts, anyone?
--
/ Peter Schuller
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
|
# 13

30-07-2010 09:56 PM
|
|
|
Todd,
Could you send a segment of the GC logs from the beginning
through the first dozen or so full GC's?
Exactly which version of the JVM are you using?
java -version
will tell us.
Do you have a test setup where you could do some experiments?
Can you send the set of CMS flags you use? It might tell
us something about the GC behavior of you application.
Might not tell us anything but it's worth a look.
Jon
On 07/06/10 13:27, Todd Lipcon wrote:
> Hi all,
>
> I work on HBase, a distributed database written in Java. We generally
> run on large heaps (8GB+), and our object lifetime distribution has
> proven pretty problematic for garbage collection (we manage a multi-GB
> LRU cache inside the process, so in CMS we tenure a lot of byte arrays
> which later get collected).
>
> In Java6, we generally run with the CMS collector, tuning down
> CMSInitiatingOccupancyFraction and constraining MaxNewSize to achieve
> fairly low pause GC, but after a week or two of uptime we often run
> into full heap compaction which takes several minutes and wreaks havoc
> on the system.
>
> Needless to say, we've been watching the development of the G1 GC with
> anticipation for the last year or two. Finally it seems in the latest
> build of JDK7 it's stable enough for actual use (previously it would
> segfault within an hour or so). However, in my testing I'm seeing a
> fair amount of 8-10 second Full GC pauses.
>
> The flags I'm passing are:
>
> -Xmx8000m -XX:+UseG1GC -XX:MaxGCPauseMillis=20
> -XX:GCPauseIntervalMillis=80
>
> Most of the pauses I see in the GC log are around 10-20ms as expected:
>
> 2010-07-05T22:43:19.849-0700: 1680.079: [GC pause (partial),
> 0.01209000 secs]
> [Parallel Time: 10.5 ms]
> [Update RS (Start) (ms): 1680080.2 1680080.1 1680080.2
> 1680079.9 1680080.0 1680080.2 1680080.1 1680080.1 1680080.0
> 1680080.1 1680080.0 1680079.9 1680081.5]
> [Update RS (ms): 1.4 2.0 2.2 1.8 1.7 1.4 2.5 2.2 1.9
> 2.5 1.7 1.7 0.1
> Avg: 1.8, Min: 0.1, Max: 2.5]
> [Processed Buffers : 8 1 3 1 1 7 3 2 6 2 7 8 3
> Sum: 52, Avg: 4, Min: 1, Max: 8]
> [Ext Root Scanning (ms): 0.7 0.5 0.5 0.3 0.5 0.7 0.4 0.5
> 0.4 0.5 0.3 0.3 0.0
> Avg: 0.4, Min: 0.0, Max: 0.7]
> [Mark Stack Scanning (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0
> 0.0 0.0 0.0 0.0 0.0 0.0
> Avg: 0.0, Min: 0.0, Max: 0.0]
> [Scan RS (ms): 0.9 0.4 0.1 0.7 0.7 0.9 0.0 0.1 0.6 0.0
> 0.8 0.8 0.9
> Avg: 0.5, Min: 0.0, Max: 0.9]
> [Object Copy (ms): 7.2 7.2 7.3 7.3 7.1 7.1 7.0 7.2 7.1
> 6.9 7.1 7.1 7.0
> Avg: 7.1, Min: 6.9, Max: 7.3]
> [Termination (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
> 0.0 0.0 0.0 0.0
> Avg: 0.0, Min: 0.0, Max: 0.0]
> [Other: 0.6 ms]
> [Clear CT: 0.5 ms]
> [Other: 1.1 ms]
> [Choose CSet: 0.0 ms]
> [ 7677M->7636M(8000M)]
> [Times: user=0.12 sys=0.00, real=0.01 secs]
>
> But every 5-10 minutes I see a GC pause that lasts 10-15 seconds:
> [todd@monster01 logs]$ grep 'Full GC' gc-hbase.log | tail
> 2010-07-06T12:50:41.216-0700: 52521.446: [Full GC 7934M->4865M(8000M),
> 9.8907800 secs]
> 2010-07-06T12:55:39.802-0700: 52820.032: [Full GC 7930M->4964M(8000M),
> 9.9025520 secs]
> 2010-07-06T13:02:26.872-0700: 53227.102: [Full GC 7934M->4882M(8000M),
> 10.1232190 secs]
> 2010-07-06T13:09:41.049-0700: 53661.279: [Full GC 7938M->5002M(8000M),
> 10.4997760 secs]
> 2010-07-06T13:18:51.531-0700: 54211.761: [Full GC 7938M->4962M(8000M),
> 11.0497380 secs]
>
> These pauses are pretty unacceptable for soft real time operation.
>
> Am I missing some tuning that should be done for G1GC for applications
> like this? Is 20ms out of 80ms too aggressive a target for the garbage
> rates we're generating?
>
> My actual live heap usage should be very stable around 5GB - the
> application very carefully accounts its live object set at around 60%
> of the max heap (as you can see in the logs above).
>
> At this point we are considering doing crazy things like ripping out
> our main memory consumers into a custom slab allocator, and manually
> reference count the byte array slices. But, if we can get G1GC to work
> for us, it will save a lot of engineering on the application side!
>
> Thanks
> -Todd
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
> ------------------------------------------------------------------------
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
Did you try doubling the heap size? You might want to post a full
log so we can see what's happening between those full collections.
Also, If you have comparable CMS logs
all the better, as a known starting point. The full gc's almost
look like the heap got too full, so it must mean that incremental
collection is not keeping up with the rate of garbage generation.
Also, what's the JDK build you are running?
-- ramki
On 07/06/10 13:27, Todd Lipcon wrote:
> Hi all,
>
> I work on HBase, a distributed database written in Java. We generally
> run on large heaps (8GB+), and our object lifetime distribution has
> proven pretty problematic for garbage collection (we manage a multi-GB
> LRU cache inside the process, so in CMS we tenure a lot of byte arrays
> which later get collected).
>
> In Java6, we generally run with the CMS collector, tuning down
> CMSInitiatingOccupancyFraction and constraining MaxNewSize to achieve
> fairly low pause GC, but after a week or two of uptime we often run into
> full heap compaction which takes several minutes and wreaks havoc on the
> system.
>
> Needless to say, we've been watching the development of the G1 GC with
> anticipation for the last year or two. Finally it seems in the latest
> build of JDK7 it's stable enough for actual use (previously it would
> segfault within an hour or so). However, in my testing I'm seeing a fair
> amount of 8-10 second Full GC pauses.
>
> The flags I'm passing are:
>
> -Xmx8000m -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:GCPauseIntervalMillis=80
>
> Most of the pauses I see in the GC log are around 10-20ms as expected:
>
> 2010-07-05T22:43:19.849-0700: 1680.079: [GC pause (partial), 0.01209000
> secs]
> [Parallel Time: 10.5 ms]
> [Update RS (Start) (ms): 1680080.2 1680080.1 1680080.2
> 1680079.9 1680080.0 1680080.2 1680080.1 1680080.1 1680080.0
> 1680080.1 1680080.0 1680079.9 1680081.5]
> [Update RS (ms): 1.4 2.0 2.2 1.8 1.7 1.4 2.5 2.2 1.9 2.5
> 1.7 1.7 0.1
> Avg: 1.8, Min: 0.1, Max: 2.5]
> [Processed Buffers : 8 1 3 1 1 7 3 2 6 2 7 8 3
> Sum: 52, Avg: 4, Min: 1, Max: 8]
> [Ext Root Scanning (ms): 0.7 0.5 0.5 0.3 0.5 0.7 0.4 0.5
> 0.4 0.5 0.3 0.3 0.0
> Avg: 0.4, Min: 0.0, Max: 0.7]
> [Mark Stack Scanning (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
> 0.0 0.0 0.0 0.0 0.0
> Avg: 0.0, Min: 0.0, Max: 0.0]
> [Scan RS (ms): 0.9 0.4 0.1 0.7 0.7 0.9 0.0 0.1 0.6 0.0
> 0.8 0.8 0.9
> Avg: 0.5, Min: 0.0, Max: 0.9]
> [Object Copy (ms): 7.2 7.2 7.3 7.3 7.1 7.1 7.0 7.2 7.1
> 6.9 7.1 7.1 7.0
> Avg: 7.1, Min: 6.9, Max: 7.3]
> [Termination (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
> 0.0 0.0 0.0 0.0
> Avg: 0.0, Min: 0.0, Max: 0.0]
> [Other: 0.6 ms]
> [Clear CT: 0.5 ms]
> [Other: 1.1 ms]
> [Choose CSet: 0.0 ms]
> [ 7677M->7636M(8000M)]
> [Times: user=0.12 sys=0.00, real=0.01 secs]
>
> But every 5-10 minutes I see a GC pause that lasts 10-15 seconds:
> [todd@monster01 logs]$ grep 'Full GC' gc-hbase.log | tail
> 2010-07-06T12:50:41.216-0700: 52521.446: [Full GC 7934M->4865M(8000M),
> 9.8907800 secs]
> 2010-07-06T12:55:39.802-0700: 52820.032: [Full GC 7930M->4964M(8000M),
> 9.9025520 secs]
> 2010-07-06T13:02:26.872-0700: 53227.102: [Full GC 7934M->4882M(8000M),
> 10.1232190 secs]
> 2010-07-06T13:09:41.049-0700: 53661.279: [Full GC 7938M->5002M(8000M),
> 10.4997760 secs]
> 2010-07-06T13:18:51.531-0700: 54211.761: [Full GC 7938M->4962M(8000M),
> 11.0497380 secs]
>
> These pauses are pretty unacceptable for soft real time operation.
>
> Am I missing some tuning that should be done for G1GC for applications
> like this? Is 20ms out of 80ms too aggressive a target for the garbage
> rates we're generating?
>
> My actual live heap usage should be very stable around 5GB - the
> application very carefully accounts its live object set at around 60% of
> the max heap (as you can see in the logs above).
>
> At this point we are considering doing crazy things like ripping out our
> main memory consumers into a custom slab allocator, and manually
> reference count the byte array slices. But, if we can get G1GC to work
> for us, it will save a lot of engineering on the application side!
>
> Thanks
> -Todd
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>
>
> ------------------------------------------------------------------------
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
I also work with Todd on this systems (I am one of the other people
with the alternate CMS config) and doubling the heap size from 8 GB to
16GB is a little insane... we'd like to have some amount of reasonable
memory efficiency here... The thing is the more we can get out of our
ram for this block cache, the better performing our systems are. Also
a lot of the settings are self tuning, so if we up the Xmx the size of
the block cache is scaled as well.
-ryan
On Tue, Jul 6, 2010 at 2:12 PM, Y. S. Ramakrishna
<> wrote:
> Did you try doubling the heap size? You might want to post a full
> log so we can see what's happening between those full collections.
> Also, If you have comparable CMS logs
> all the better, as a known starting point. The full gc's almost
> look like the heap got too full, so it must mean that incremental
> collection is not keeping up with the rate of garbage generation.
> Also, what's the JDK build you are running?
>
> -- ramki
>
> On 07/06/10 13:27, Todd Lipcon wrote:
>> Hi all,
>>
>> I work on HBase, a distributed database written in Java. We generally
>> run on large heaps (8GB+), and our object lifetime distribution has
>> proven pretty problematic for garbage collection (we manage a multi-GB
>> LRU cache inside the process, so in CMS we tenure a lot of byte arrays
>> which later get collected).
>>
>> In Java6, we generally run with the CMS collector, tuning down
>> CMSInitiatingOccupancyFraction and constraining MaxNewSize to achieve
>> fairly low pause GC, but after a week or two of uptime we often run into
>> full heap compaction which takes several minutes and wreaks havoc on the
>> system.
>>
>> Needless to say, we've been watching the development of the G1 GC with
>> anticipation for the last year or two. Finally it seems in the latest
>> build of JDK7 it's stable enough for actual use (previously it would
>> segfault within an hour or so). However, in my testing I'm seeing a fair
>> amount of 8-10 second Full GC pauses.
>>
>> The flags I'm passing are:
>>
>> -Xmx8000m -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:GCPauseIntervalMillis=80
>>
>> Most of the pauses I see in the GC log are around 10-20ms as expected:
>>
>> 2010-07-05T22:43:19.849-0700: 1680.079: [GC pause (partial), 0.01209000
>> secs]
>> [Parallel Time: 10.5 ms]
>> [Update RS (Start) (ms): 1680080.2 1680080.1 1680080.2
>> 1680079.9 1680080.0 1680080.2 1680080.1 1680080.1 1680080.0
>> 1680080.1 1680080.0 1680079.9 1680081.5]
>> [Update RS (ms): 1.4 2.0 2.2 1.8 1.7 1.4 2.5 2.2 1.9 2.5
>> 1.7 1.7 0.1
>> Avg: 1.8, Min: 0.1, Max: 2.5]
>> [Processed Buffers : 8 1 3 1 1 7 3 2 6 2 7 8 3
>> Sum: 52, Avg: 4, Min: 1, Max: 8]
>> [Ext Root Scanning (ms): 0.7 0.5 0.5 0.3 0.5 0.7 0.4 0.5
>> 0.4 0.5 0.3 0.3 0.0
>> Avg: 0.4, Min: 0.0, Max: 0.7]
>> [Mark Stack Scanning (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
>> 0.0 0.0 0.0 0.0 0.0
>> Avg: 0.0, Min: 0.0, Max: 0.0]
>> [Scan RS (ms): 0.9 0.4 0.1 0.7 0.7 0.9 0.0 0.1 0.6 0.0
>> 0.8 0.8 0.9
>> Avg: 0.5, Min: 0.0, Max: 0.9]
>> [Object Copy (ms): 7.2 7.2 7.3 7.3 7.1 7.1 7.0 7.2 7.1
>> 6.9 7.1 7.1 7.0
>> Avg: 7.1, Min: 6.9, Max: 7.3]
>> [Termination (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
>> 0.0 0.0 0.0 0.0
>> Avg: 0.0, Min: 0.0, Max: 0.0]
>> [Other: 0.6 ms]
>> [Clear CT: 0.5 ms]
>> [Other: 1.1 ms]
>> [Choose CSet: 0.0 ms]
>> [ 7677M->7636M(8000M)]
>> [Times: user=0.12 sys=0.00, real=0.01 secs]
>>
>> But every 5-10 minutes I see a GC pause that lasts 10-15 seconds:
>> [todd@monster01 logs]$ grep 'Full GC' gc-hbase.log | tail
>> 2010-07-06T12:50:41.216-0700: 52521.446: [Full GC 7934M->4865M(8000M),
>> 9.8907800 secs]
>> 2010-07-06T12:55:39.802-0700: 52820.032: [Full GC 7930M->4964M(8000M),
>> 9.9025520 secs]
>> 2010-07-06T13:02:26.872-0700: 53227.102: [Full GC 7934M->4882M(8000M),
>> 10.1232190 secs]
>> 2010-07-06T13:09:41.049-0700: 53661.279: [Full GC 7938M->5002M(8000M),
>> 10.4997760 secs]
>> 2010-07-06T13:18:51.531-0700: 54211.761: [Full GC 7938M->4962M(8000M),
>> 11.0497380 secs]
>>
>> These pauses are pretty unacceptable for soft real time operation.
>>
>> Am I missing some tuning that should be done for G1GC for applications
>> like this? Is 20ms out of 80ms too aggressive a target for the garbage
>> rates we're generating?
>>
>> My actual live heap usage should be very stable around 5GB - the
>> application very carefully accounts its live object set at around 60% of
>> the max heap (as you can see in the logs above).
>>
>> At this point we are considering doing crazy things like ripping out our
>> main memory consumers into a custom slab allocator, and manually
>> reference count the byte array slices. But, if we can get G1GC to work
>> for us, it will save a lot of engineering on the application side!
>>
>> Thanks
>> -Todd
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
>>
>> ------------------------------------------------------------------------
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
Two questions:
(1) do you enable class unloading with CMS? (I do not see that
below in yr option list, but wondered.)
(2) does your application load classes, or intern a lot of strings?
If i am reading the logs right, G1 appears to reclaim less
and less of the heap in each cycle until a full collection
intervenes, and I have no real explanation for this behaviour
except that perhaps there's something in the perm gen that
keeps stuff in the remainder of the heap artificially live.
G1 does not incrementally collect the young gen, so this is
plausible. But CMS does not either by default and I do not see
that option in the CMS options list you gave below. It would
be instructive to see what the comparable CMS logs look like.
May be then you could start with the same heap shapes for the
two and see if you can get to the bottom of the full gc (which
as i understand you get to more quickly w/G1 than you did
w/CMS).
-- ramki
On 07/06/10 14:24, Todd Lipcon wrote:
> On Tue, Jul 6, 2010 at 2:09 PM, Jon Masamitsu <
> > wrote:
>
> Todd,
>
> Could you send a segment of the GC logs from the beginning
> through the first dozen or so full GC's?
>
>
> Sure, I just put it online at:
>
> http://cloudera-todd.s3.amazonaws.com/gc-log-g1gc.txt
>
>
>
> Exactly which version of the JVM are you using?
>
> java -version
>
> will tell us.
>
>
> Latest as of last night:
>
> [todd@monster01 ~]$ ./jdk1.7.0/jre/bin/java -version
> java version "1.7.0-ea"
> Java(TM) SE Runtime Environment (build 1.7.0-ea-b99)
> Java HotSpot(TM) 64-Bit Server VM (build 19.0-b03, mixed mode)
>
>
> Do you have a test setup where you could do some experiments?
>
>
> Sure, I have a five node cluster here where I do lots of testing, happy
> to try different builds/options/etc (though I probably don't have time
> to apply patches and rebuild the JDK myself)
>
>
> Can you send the set of CMS flags you use? It might tell
> us something about the GC behavior of you application.
> Might not tell us anything but it's worth a look.
>
>
> Different customers have found different flags to work well for them.
> One user uses the following:
>
>
> -Xmx12000m -XX:+HeapDumpOnOutOfMemoryError -XX:+UseConcMarkSweepGC \
> -XX:NewSize=64m -XX:MaxNewSize=64m -XX:CMSInitiatingOccupancyFraction=40 \
>
>
> Another uses:
>
>
> -XX:+DoEscapeAnalysis -XX:+AggressiveOpts -XX:+UseConcMarkSweepGC
> -XX:NewSize=64m -XX:MaxNewSize=64m
> -XX:CMSInitiatingOccupancyFraction=88 -verbose:gc -XX:+PrintGCDetails
>
>
>
>
>
>
>
> The particular tuning options probably depend on the actual cache
> workload of the user. I tend to recommend CMSInitiatingOccupancyFraction
> around 75 or so, since the software maintains about 60% heap usage. I
> also think a NewSize slightly larger would improve things a bit, but if
> it gets more than 256m or so, the ParNew pauses start to be too long for
> a lot of use cases.
>
> Regarding CMS logs, I can probably restart this test later this
> afternoon on CMS and run it for a couple hours, but it isn't likely to
> hit the multi-minute compaction that quickly. It happens more in the wild.
>
> -Todd
>
>
>
> On 07/06/10 13:27, Todd Lipcon wrote:
>> Hi all,
>>
>> I work on HBase, a distributed database written in Java. We
>> generally run on large heaps (8GB+), and our object lifetime
>> distribution has proven pretty problematic for garbage collection
>> (we manage a multi-GB LRU cache inside the process, so in CMS we
>> tenure a lot of byte arrays which later get collected).
>>
>> In Java6, we generally run with the CMS collector, tuning down
>> CMSInitiatingOccupancyFraction and constraining MaxNewSize to
>> achieve fairly low pause GC, but after a week or two of uptime we
>> often run into full heap compaction which takes several minutes
>> and wreaks havoc on the system.
>>
>> Needless to say, we've been watching the development of the G1 GC
>> with anticipation for the last year or two. Finally it seems in
>> the latest build of JDK7 it's stable enough for actual use
>> (previously it would segfault within an hour or so). However, in
>> my testing I'm seeing a fair amount of 8-10 second Full GC pauses.
>>
>> The flags I'm passing are:
>>
>> -Xmx8000m -XX:+UseG1GC -XX:MaxGCPauseMillis=20
>> -XX:GCPauseIntervalMillis=80
>>
>> Most of the pauses I see in the GC log are around 10-20ms as expected:
>>
>> 2010-07-05T22:43:19.849-0700: 1680.079: [GC pause (partial),
>> 0.01209000 secs]
>> [Parallel Time: 10.5 ms]
>> [Update RS (Start) (ms): 1680080.2 1680080.1 1680080.2
>> 1680079.9 1680080.0 1680080.2 1680080.1 1680080.1 1680080.0
>> 1680080.1 1680080.0 1680079.9 1680081.5]
>> [Update RS (ms): 1.4 2.0 2.2 1.8 1.7 1.4 2.5 2.2
>> 1.9 2.5 1.7 1.7 0.1
>> Avg: 1.8, Min: 0.1, Max: 2.5]
>> [Processed Buffers : 8 1 3 1 1 7 3 2 6 2 7 8 3
>> Sum: 52, Avg: 4, Min: 1, Max: 8]
>> [Ext Root Scanning (ms): 0.7 0.5 0.5 0.3 0.5 0.7 0.4
>> 0.5 0.4 0.5 0.3 0.3 0.0
>> Avg: 0.4, Min: 0.0, Max: 0.7]
>> [Mark Stack Scanning (ms): 0.0 0.0 0.0 0.0 0.0 0.0
>> 0.0 0.0 0.0 0.0 0.0 0.0 0.0
>> Avg: 0.0, Min: 0.0, Max: 0.0]
>> [Scan RS (ms): 0.9 0.4 0.1 0.7 0.7 0.9 0.0 0.1 0.6
>> 0.0 0.8 0.8 0.9
>> Avg: 0.5, Min: 0.0, Max: 0.9]
>> [Object Copy (ms): 7.2 7.2 7.3 7.3 7.1 7.1 7.0 7.2
>> 7.1 6.9 7.1 7.1 7.0
>> Avg: 7.1, Min: 6.9, Max: 7.3]
>> [Termination (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
>> 0.0 0.0 0.0 0.0 0.0
>> Avg: 0.0, Min: 0.0, Max: 0.0]
>> [Other: 0.6 ms]
>> [Clear CT: 0.5 ms]
>> [Other: 1.1 ms]
>> [Choose CSet: 0.0 ms]
>> [ 7677M->7636M(8000M)]
>> [Times: user=0.12 sys=0.00, real=0.01 secs]
>>
>> But every 5-10 minutes I see a GC pause that lasts 10-15 seconds:
>> [todd@monster01 logs]$ grep 'Full GC' gc-hbase.log | tail
>> 2010-07-06T12:50:41.216-0700: 52521.446: [Full GC
>> 7934M->4865M(8000M), 9.8907800 secs]
>> 2010-07-06T12:55:39.802-0700: 52820.032: [Full GC
>> 7930M->4964M(8000M), 9.9025520 secs]
>> 2010-07-06T13:02:26.872-0700: 53227.102: [Full GC
>> 7934M->4882M(8000M), 10.1232190 secs]
>> 2010-07-06T13:09:41.049-0700: 53661.279: [Full GC
>> 7938M->5002M(8000M), 10.4997760 secs]
>> 2010-07-06T13:18:51.531-0700: 54211.761: [Full GC
>> 7938M->4962M(8000M), 11.0497380 secs]
>>
>> These pauses are pretty unacceptable for soft real time operation.
>>
>> Am I missing some tuning that should be done for G1GC for
>> applications like this? Is 20ms out of 80ms too aggressive a
>> target for the garbage rates we're generating?
>>
>> My actual live heap usage should be very stable around 5GB - the
>> application very carefully accounts its live object set at around
>> 60% of the max heap (as you can see in the logs above).
>>
>> At this point we are considering doing crazy things like ripping
>> out our main memory consumers into a custom slab allocator, and
>> manually reference count the byte array slices. But, if we can get
>> G1GC to work for us, it will save a lot of engineering on the
>> application side!
>>
>> Thanks
>> -Todd
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>> ------------------------------------------------------------------------
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
On 07/07/10 08:45, Todd Lipcon wrote:
...
>
> Overnight I saw one "concurrent mode failure".
...
> 2010-07-07T07:56:27.786-0700: 28490.203: [GC 28490.203: [ParNew
> (promotion failed): 59008K->59008K(59008K), 0.0179250 secs]28490.221:
> [CMS2010-07-07T07:56:27.901-0700: 28490.317: [CMS-concurrent-preclean:
> 0.556/0.947 secs] [Times:
> user=5.76 sys=0.26, real=0.95 secs]
> (concurrent mode failure): 6359176K->4206871K(8323072K), 17.4366220
> secs] 6417373K->4206871K(8382080K), [CMS Perm : 18609K->18565K(31048K)],
> 17.4546890 secs] [Times: user=11.17 sys=0.09, real=17.45 secs]
>
> I've interpreted pauses like this as being caused by fragmentation,
> since the young gen is 64M, and the old gen here has about 2G free. If
> there's something I'm not understanding about CMS, and I can tune it
> more smartly to avoid these longer pauses, I'm happy to try.
Yes the old gen must be fragmented. I'll look at the data you have
made available (for CMS). The CMS log you uploaded does not have the
suffix leading into the concurrent mode failure ypu display above
(it stops less than 2500 s into the run). If you could include
the entire log leading into the concurrent mode failures, it would
be a great help. Do you have large arrays in your
application? The shape of the promotion graph for CMS is somewhat
jagged, indicating _perhaps_ that. Yes, +PrintTenuringDistribution
would shed a bit more light. As regards fragmentation, it can be
tricky to tune against, but we can try once we understand a bit
more about the object sizes and demographics.
I am sure you don't have an easily shared test case, so we
can reproduce both the CMS fragmentation and the G1 full gc
issues locally for quickest progress on this?
-- ramki
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
On 07/07/10 17:18, Todd Lipcon wrote:
...
> Looking at the graph you attached, it appears that the low-water mark
> stabilizes at somewhere between 4.5G and 5G. The configuration I'm
> running is to allocate 40% of the heap to Memstore and 20% of the heap
> to the LRU cache. For an 8G heap, this is 4.8GB. So, for this
> application it's somewhat expected that, as it runs, it will accumulate
> more and more data until it reaches this threshold. The data is, of
> course, not *permanent*, but it's reasonably long-lived, so it makes
> sense to me that it should go into the old generation.
Ah, i see. In that case, i think you could try using a slightly larger old
gen. If the old gen stabilizes at 4.2 GB, we should allow as much for slop.
i.e. make the old gen 8.4 GB (or whatever is the measured stable
old gen occupancy), then add to that the young gen size, and use
that for the whole heap. I would be even more aggressive
and grant more to the old gen -- as i said earlier perhaps
double the old gen from its present size. If that doesn;t work
we know that something is amiss in the way we are going at this.
If it works, we can iterate downwards from a config that we know
works, down to what may be considered an acceptable space overhead
for GC.
>
> If you like, I can tune down those percentages to 20/20 instead of
> 20/40, and I think we'll see the same pattern, just stabilized around
> 3.2GB. This will probably delay the full GCs, but still eventually hit
> them. It's also way lower than we can really go - customers won't like
> "throwing away" 60% of the allocated heap to GC!
I understand that sentiment. I want us to get to a state where we are able
to completely avoid the creeping fragmentation, if possible. There are
other ways to tune for this, but they are more labour-intensive and tricky,
and I would not want to go into that lightly. You might want to contact
your Java support for help with that.
>
>
> Perhaps jmap -histo:live or +PrintClassHistogram[AfterFullGC]
> will help you get to the bottom of yr leak. Once the leak is plugged
> perhaps we could come back to the G1 tuning effort? (We have some
> guesses as to what might be happening and the best G1 minds are
> chewing on the info you provided so far, for which thanks!)
>
>
> I can try running with those options and see what I see, but I've
> already spent some time looking at heap dumps, and not found any leaks,
> so I'm pretty sure it's not the issue.
OK, in that case it's not worth doing, since you've already ruled
out leaks.
I'll think some more about this meanwhile.
thanks.
-- ramki
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
On 07/07/10 17:32, Todd Lipcon wrote:
...
> OK, I can try some tests with cache configured for only 40% heap usage.
> Should I run these tests with CMS or G1?
I'd first try CMS, and if that works, try G1.
>
>
>
>
> If you like, I can tune down those percentages to 20/20 instead
> of 20/40, and I think we'll see the same pattern, just
> stabilized around 3.2GB. This will probably delay the full GCs,
> but still eventually hit them. It's also way lower than we can
> really go - customers won't like "throwing away" 60% of the
> allocated heap to GC!
>
>
> I understand that sentiment. I want us to get to a state where we
> are able
> to completely avoid the creeping fragmentation, if possible. There are
> other ways to tune for this, but they are more labour-intensive and
> tricky,
> and I would not want to go into that lightly. You might want to contact
> your Java support for help with that.
>
>
> Yep, we've considered various solutions involving managing our own
> ref-counted slices of a single pre-allocated byte array - essentially
> writing our own slab allocator. In theory this should make all of the
> GCable objects constrained to a small number of sizes, and thus prevent
> fragmentation, but it's quite a project to undertake :)
That would be overdoing it. I didn't mean anything so drastic and certainly
nothing so drastic at the application level. When I said "labour intensive"
I meant tuning GC to avoid that kind of fragmentation would be more work.
>
> Regarding Java support, as an open source project we have no such
> luxury. Projects like HBase and Hadoop, though, are pretty visible to
> users as "big Java apps", so getting them working well on the GC front
> does good things for Java adoption in the database/distributed systems
> community, I think.
I agree, and we certainly should.
-- ramki
>
> -Todd
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>
>
> ------------------------------------------------------------------------
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
> Am I missing some tuning that should be done for G1GC for applications like
> this? Is 20ms out of 80ms too aggressive a target for the garbage rates
> we're generating?
I have never run HBase, but in an LRU stress test (I posted about it a
few months ago) I specifically observed remembered set scanning costs
go way up. In addition I was seeing fallbacks to full GC:s recently in
a slightly different test that I also posed about to -use, and that
turned out to be a result of the estimated rset scanning costs being
so high that regions were never selected for eviction even though they
had very little live data. I would be very interested to hear if
you're having the same problem. My last post on the topic is here:
http://mail.openjdk.java.net/pipermail/hotspot-gc-use/2010-June/000652.html
Including the link to the (throw-away) patch that should tell you
whether this is what's happening:
http://distfiles.scode.org/mlref/g1/g1_region_live_stats_hack.patch
Out of personal curiosity I'd be very interested to hear whether this
is what's happening to you (in a real reasonable use-case rather than
a synthetic benchmark).
My sense (and hotspot/g1 developers please smack me around if I am
misrepresenting anything here) is that the effect I saw (with rset
scanning costs) could cause perpetual memory grow (until fallback to
full GC) in two ways:
(1) The estimated (and possibly real) cost of rset scanning for a
single region could be so high that it is never possible to select it
for eviction given the asked for pause time goals. Hence, such a
region effectively "leaks" until full GC.
(2) The estimated (and possibly real) cost of rset scanning for
regions may be so high that there are, in practice, always other
regions selected for high pay-off/cost ratios, such that they end up
never being collected even if theoretically a single region could be
evicted within the pause time goal.
These are effectively the same thing, with (1) being an extreme case of (2).
In both cases, the effect should be mitigated (and have been in the
case where I did my testing), but as far as I can tell not generally
"fixed", by increasing the pause time goals.
It is unclear to me how this is intended to be handled. The original
g1 paper mentions an rset scanning thread that I may suspect would be
intended to help do rset scanning in the background such that regions
like these could be evicted more cheaply during the STW eviction
pause; but I didn't find such a thread anywhere in the source code -
but I may very well just be missing it.
--
/ Peter Schuller
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
Hi Peter --
Yes, my guess was also that something (possibly along the lines
you stated below) was preventing the selection of certain (sets
of) regions for evacuation on a regular basis ... I am told there
are flags that will allow you to get verbose details on what is
or is not selected for inclusion in the collection set; perhaps
that will help you get down to the bottom of this. Did you say
you had a test case that showed this behaviour? Filing a bug
with that test case may be the quickest way to get this before
the right set of eyes. Over to the G1 cognoscenti.
-- ramki
On 07/12/10 09:02, Peter Schuller wrote:
>> Am I missing some tuning that should be done for G1GC for applications like
>> this? Is 20ms out of 80ms too aggressive a target for the garbage rates
>> we're generating?
>
> I have never run HBase, but in an LRU stress test (I posted about it a
> few months ago) I specifically observed remembered set scanning costs
> go way up. In addition I was seeing fallbacks to full GC:s recently in
> a slightly different test that I also posed about to -use, and that
> turned out to be a result of the estimated rset scanning costs being
> so high that regions were never selected for eviction even though they
> had very little live data. I would be very interested to hear if
> you're having the same problem. My last post on the topic is here:
>
> http://mail.openjdk.java.net/pipermail/hotspot-gc-use/2010-June/000652.html
>
> Including the link to the (throw-away) patch that should tell you
> whether this is what's happening:
>
> http://distfiles.scode.org/mlref/g1/g1_region_live_stats_hack.patch
>
> Out of personal curiosity I'd be very interested to hear whether this
> is what's happening to you (in a real reasonable use-case rather than
> a synthetic benchmark).
>
> My sense (and hotspot/g1 developers please smack me around if I am
> misrepresenting anything here) is that the effect I saw (with rset
> scanning costs) could cause perpetual memory grow (until fallback to
> full GC) in two ways:
>
> (1) The estimated (and possibly real) cost of rset scanning for a
> single region could be so high that it is never possible to select it
> for eviction given the asked for pause time goals. Hence, such a
> region effectively "leaks" until full GC.
>
> (2) The estimated (and possibly real) cost of rset scanning for
> regions may be so high that there are, in practice, always other
> regions selected for high pay-off/cost ratios, such that they end up
> never being collected even if theoretically a single region could be
> evicted within the pause time goal.
>
> These are effectively the same thing, with (1) being an extreme case of (2).
>
> In both cases, the effect should be mitigated (and have been in the
> case where I did my testing), but as far as I can tell not generally
> "fixed", by increasing the pause time goals.
>
> It is unclear to me how this is intended to be handled. The original
> g1 paper mentions an rset scanning thread that I may suspect would be
> intended to help do rset scanning in the background such that regions
> like these could be evicted more cheaply during the STW eviction
> pause; but I didn't find such a thread anywhere in the source code -
> but I may very well just be missing it.
>
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
Peter and Todd,
Any chance of setting +PrintHeapAtGC and -XX:+PrintHeapAtGCExtended and
sending us the log, or part of it (say between two Full GCs)? Be
prepared: this will generate piles of output. But it will give us
per-region information that might shed more light on the cause of the
issue.... thanks,
Tony, HS GC Group
Peter Schuller wrote:
>> Am I missing some tuning that should be done for G1GC for applications like
>> this? Is 20ms out of 80ms too aggressive a target for the garbage rates
>> we're generating?
>>
>
> I have never run HBase, but in an LRU stress test (I posted about it a
> few months ago) I specifically observed remembered set scanning costs
> go way up. In addition I was seeing fallbacks to full GC:s recently in
> a slightly different test that I also posed about to -use, and that
> turned out to be a result of the estimated rset scanning costs being
> so high that regions were never selected for eviction even though they
> had very little live data. I would be very interested to hear if
> you're having the same problem. My last post on the topic is here:
>
> http://mail.openjdk.java.net/pipermail/hotspot-gc-use/2010-June/000652.html
>
> Including the link to the (throw-away) patch that should tell you
> whether this is what's happening:
>
> http://distfiles.scode.org/mlref/g1/g1_region_live_stats_hack.patch
>
> Out of personal curiosity I'd be very interested to hear whether this
> is what's happening to you (in a real reasonable use-case rather than
> a synthetic benchmark).
>
> My sense (and hotspot/g1 developers please smack me around if I am
> misrepresenting anything here) is that the effect I saw (with rset
> scanning costs) could cause perpetual memory grow (until fallback to
> full GC) in two ways:
>
> (1) The estimated (and possibly real) cost of rset scanning for a
> single region could be so high that it is never possible to select it
> for eviction given the asked for pause time goals. Hence, such a
> region effectively "leaks" until full GC.
>
> (2) The estimated (and possibly real) cost of rset scanning for
> regions may be so high that there are, in practice, always other
> regions selected for high pay-off/cost ratios, such that they end up
> never being collected even if theoretically a single region could be
> evicted within the pause time goal.
>
> These are effectively the same thing, with (1) being an extreme case of (2).
>
> In both cases, the effect should be mitigated (and have been in the
> case where I did my testing), but as far as I can tell not generally
> "fixed", by increasing the pause time goals.
>
> It is unclear to me how this is intended to be handled. The original
> g1 paper mentions an rset scanning thread that I may suspect would be
> intended to help do rset scanning in the background such that regions
> like these could be evicted more cheaply during the STW eviction
> pause; but I didn't find such a thread anywhere in the source code -
> but I may very well just be missing it.
>
>
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
Ramki/Tony,
> Any chance of setting +PrintHeapAtGC and -XX:+PrintHeapAtGCExtended and
> sending us the log, or part of it (say between two Full GCs)? Be prepared:
> this will generate piles of output. But it will give us per-region
> information that might shed more light on the cause of the issue.... thanks,
So what I have in terms of data is (see footnotes for urls references in []):
(a) A patch[1] that prints some additional information about estimated
costs of region eviction, and disables the GC efficiency check that
normally terminates selection of regions. (Note: This is a throw-away
patch for debugging; it's not intended as a suggested change for
inclusion.)
(b) A log[2] showing the output of a test run I did just now, with
both your flags above and my patch enabled (but without disabling the
efficiency check). It shows fallback to full GC when the actual live
set size is 252 MB, and the maximum heap size is 2 GB (in other words,
~ 12% liveness). An easy way to find the point of full gc is to search
for the string 'full 1'.
(c) A file[3] with the effective VM options during the test.
(d) Instructions for how to run the test to reproduce it (I'll get to
that at the end; it's simplified relative to previously).
(e) Nature of the test.
Discussion:
WIth respect to region information: I originally tried it in response
to your recommendation earlier, but I found I did not see the
information I was after. Perhaps I was just misreading it, but I
mostly just saw either 0% or 100% fullness, and never the actual
liveness estimate as produced by the mark phase. In the log I am
referring to in this E-Mail, you can see that the last printout of
region information just before the live GC fits this pattern; I just
don't see anything that looks like legitimate liveness information
being printed. (I don't have time to dig back into it right now to
double-check what it's printing.)
If you scroll up from the point of the full gc until you find a bunch
of output starting with "predict_region_elapsed_time_ms" you see some
output resulting from the patch, with pretty extreme values such as:
predict_region_elapsed_time_ms: 34.378642ms total, 34.021154ms rs scan
(46542 cnum), 0.040069 copy time (20704 bytes), 0.317419 other time
predict_region_elapsed_time_ms: 45.020866ms total, 44.653222ms rs scan
(61087 cnum), 0.050225 copy time (25952 bytes), 0.317419 other time
predict_region_elapsed_time_ms: 16.250033ms total, 15.887065ms rs scan
(21734 cnum), 0.045550 copy time (23536 bytes), 0.317419 other time
predict_region_elapsed_time_ms: 226.942877ms total, 226.559163ms rs
scan (309940 cnum), 0.066296 copy time (34256 bytes), 0.317419 other
time
predict_region_elapsed_time_ms: 542.344828ms total, 541.954750ms rs
scan (741411 cnum), 0.072659 copy time (37544 bytes), 0.317419 other
time
predict_region_elapsed_time_ms: 668.595597ms total, 668.220877ms rs
scan (914147 cnum), 0.057301 copy time (29608 bytes), 0.317419 other
time
So in the most extreme case in the excerpt above, that's > half a
second of estimate rset scanning time for a single region with 914147
cards to be scanned. While not all are that extreme, lots and lots of
regions are very expensive and almost only due to rset scanning costs.
If you scroll down a bit to the first (and ONLY) partial that happened
after the statistics accumulating from the marking phase, we see more
output resulting form the patch. At the end, we see:
(picked region; 0.345890ms predicted; 1.317244ms remaining; 15kb
marked; 15kb maxlive; 1-1% liveness)
(393380 KB left in heap.)
(picked region; 0.345963ms predicted; 0.971354ms remaining; 15kb
marked; 15kb maxlive; 1-1% liveness)
(393365 KB left in heap.)
(picked region; 0.346036ms predicted; 0.625391ms remaining; 15kb
marked; 15kb maxlive; 1-1% liveness)
(393349 KB left in heap.)
(no more marked regions; next region too expensive (adaptive;
predicted 0.346036ms > remaining 0.279355ms))
So in other words, it picked a bunch of regions in order of "lowest
hanging fruit". The *least* low hanging fruit picked still had
liveness at 1%; in other words, there's plenty of further regions that
ideally should be collected because they contain almost no garbage
(ignoring the cost of collecting them).
In this case, it stopped picking regions because the next region to be
picked, though cheap, was the straw that broke the camel's neck and we
simply exceeded the alloted time for this particular GC.
However, after this partial completes, it reverts back to doing just
young gc:s. In other words, even though there's *plenty* of regions
with very low liveness, further partials aren't happening.
By applying this part of the patch:
- (adaptive_young_list_length() &&
+ (adaptive_young_list_length() && false && // scodetodo
I artificially force g1 to not fall back to doing young gc:s for
efficiency reasons. When I run with that change, I don't experience
the slow perpetual growth until fallback to full GC. If I remember
correctly though, the rset scanning cost is in fact high, but I don't
have details saved and I'm afraid I don't have time to re-run those
tests right now and compare numbers.
Reproducing it:
I made some changes and the test case should now hopefully be easy to
run assuming you have maven installed. The github project is at:
http://github.com/scode/httpgctest
There is a README, but the shortest possible instructions to
re-produce the test that I did:
git clone git://github.com/scode/httpgctest.git
cd httpgctest.git
git checkout 20100714_1 # grab from appropriate tag, in case I
change master
mvn package
HTTPGCTEST_LOGGC=gc.log ./run.sh
That should start the http server; then run concurrently:
while [ 1 ] ; do curl 'http://localhost:9191/dropdata?ratio=0.10' ;
curl 'http://localhost:9191/gendata?amount=25000' ; sleep 0.1 ; done
And then just wait and observe.
Nature of the test:
So the test if run as above will essentially reach a steady state of
equilibrium with about 25000 pieces of data in a clojure immutable
map. The result is that a significant amount of new data is being
allocated, but very little writing to old regions is happening. The
garbage generated is very well spread out over the entire heap because
it goes through all objects and drops 10% (the ratio=0.10) for each
iteration, after which it adds 25000 new items.
In other words; not a lot of old gen writing, but lots of writes to
the young gen referencing objects in the old gen.
[1] http://distfiles.scode.org/mlref/g1/g1_region_live_stats_hack.patch
[2] http://distfiles.scode.org/mlref/gctest/httpgctest-g1-fullgc-20100714/gc-fullfallback.log
[3] http://distfiles.scode.org/mlref/gctest/httpgctest-g1-fullgc-20100714/vmoptions.txt
--
/ Peter Schuller
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
> I consistently am seeing "Other time" estimates in the >40ms range. Given my
> pause time goal of 20ms, these estimates are I think excluding most of the
> regions from collectability. I haven't been able to dig around yet to figure
> out where the long estimate for "other time" is coming from - in the
> collections logged it sometimes shows fairly high "Other" but the "Choose
> CSet" component is very short.
(The following is wannabe speculation based on limited understanding
of the code, please take it with a grain of salt.)
My first thought here is swapping. My reading is that other time is
going to be the collection set selection time plus the collection set
free time (or at least intended to be). I think (am I wrong?) that
this should be really low under normal circumstances since no "bulk"
work is done really; in particular the *per-region* cost should be
low.
If the cost of these operations *per region* ended up being predicted
to > 40ms, I wonder if this was not due to swapping?
Additionally: As far as I can tell the estimated 'other' cost is based
on a history of the cost from previous GC:s and completely independent
of the particular region being evaluated.
Anyways, I suspect you've already confirmed that the system is not
actively swapping at the time of the fallback to full GC. But here is
one low-confidence hypothesis (it would be really great to hear from
one of the gc devs whether it is even remotely plausible):
* At some point in time, there was swapping happening affecting GC
operations such that the work done do gather stats and select regions
was slow (makes some sense since that should touch lots of distinct
regions and you don't need a lot of those memory accesses swapping to
accumulate quite a bit of time).
* This screwed up the 'other' cost history and thus the prediction,
possibly for both young and non-young regions.
* I believe young collections would never be entirely prevented due to
pause time goals, so here the cost history and thus predictions would
always have time to recover and you would not notice any effect
looking at the behavior of the system down the line.
* Non-young "other" cost was so high that non-young regions were never
selected. This in turn meant that additional cost history for the
"other" category was never recorded, preventing recovery from the
temporary swap storm.
* The end result is that no non-young regions are ever collected, and
you end up falling back to full GC once the young collections have
"leaked" enough garbage.
Thoughts, anyone?
--
/ Peter Schuller
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
> I consistently am seeing "Other time" estimates in the >40ms range. Given my
> pause time goal of 20ms, these estimates are I think excluding most of the
Btw, to test the hypothesis: When you say "constantly", are the times
in fact so consistent that it's either exactly the same or almost,
possibly being consistent with my proposed hypothesis that the
non-young "other" time is stuck? If the young other time is not stuck
I guess one might see some variation (I seem to get < 1 ms on my
machine) but not a lot at all in comparison to 40ms. If you're seeing
variation like 40-42 all the time, and it never decreasing
significantly after it reached the 40ms range, that would be
consistent with the hypothesis I believe.
--
/ Peter Schuller
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
|
# 14

30-07-2010 10:39 PM
|
|
|
Todd,
Could you send a segment of the GC logs from the beginning
through the first dozen or so full GC's?
Exactly which version of the JVM are you using?
java -version
will tell us.
Do you have a test setup where you could do some experiments?
Can you send the set of CMS flags you use? It might tell
us something about the GC behavior of you application.
Might not tell us anything but it's worth a look.
Jon
On 07/06/10 13:27, Todd Lipcon wrote:
> Hi all,
>
> I work on HBase, a distributed database written in Java. We generally
> run on large heaps (8GB+), and our object lifetime distribution has
> proven pretty problematic for garbage collection (we manage a multi-GB
> LRU cache inside the process, so in CMS we tenure a lot of byte arrays
> which later get collected).
>
> In Java6, we generally run with the CMS collector, tuning down
> CMSInitiatingOccupancyFraction and constraining MaxNewSize to achieve
> fairly low pause GC, but after a week or two of uptime we often run
> into full heap compaction which takes several minutes and wreaks havoc
> on the system.
>
> Needless to say, we've been watching the development of the G1 GC with
> anticipation for the last year or two. Finally it seems in the latest
> build of JDK7 it's stable enough for actual use (previously it would
> segfault within an hour or so). However, in my testing I'm seeing a
> fair amount of 8-10 second Full GC pauses.
>
> The flags I'm passing are:
>
> -Xmx8000m -XX:+UseG1GC -XX:MaxGCPauseMillis=20
> -XX:GCPauseIntervalMillis=80
>
> Most of the pauses I see in the GC log are around 10-20ms as expected:
>
> 2010-07-05T22:43:19.849-0700: 1680.079: [GC pause (partial),
> 0.01209000 secs]
> [Parallel Time: 10.5 ms]
> [Update RS (Start) (ms): 1680080.2 1680080.1 1680080.2
> 1680079.9 1680080.0 1680080.2 1680080.1 1680080.1 1680080.0
> 1680080.1 1680080.0 1680079.9 1680081.5]
> [Update RS (ms): 1.4 2.0 2.2 1.8 1.7 1.4 2.5 2.2 1.9
> 2.5 1.7 1.7 0.1
> Avg: 1.8, Min: 0.1, Max: 2.5]
> [Processed Buffers : 8 1 3 1 1 7 3 2 6 2 7 8 3
> Sum: 52, Avg: 4, Min: 1, Max: 8]
> [Ext Root Scanning (ms): 0.7 0.5 0.5 0.3 0.5 0.7 0.4 0.5
> 0.4 0.5 0.3 0.3 0.0
> Avg: 0.4, Min: 0.0, Max: 0.7]
> [Mark Stack Scanning (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0
> 0.0 0.0 0.0 0.0 0.0 0.0
> Avg: 0.0, Min: 0.0, Max: 0.0]
> [Scan RS (ms): 0.9 0.4 0.1 0.7 0.7 0.9 0.0 0.1 0.6 0.0
> 0.8 0.8 0.9
> Avg: 0.5, Min: 0.0, Max: 0.9]
> [Object Copy (ms): 7.2 7.2 7.3 7.3 7.1 7.1 7.0 7.2 7.1
> 6.9 7.1 7.1 7.0
> Avg: 7.1, Min: 6.9, Max: 7.3]
> [Termination (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
> 0.0 0.0 0.0 0.0
> Avg: 0.0, Min: 0.0, Max: 0.0]
> [Other: 0.6 ms]
> [Clear CT: 0.5 ms]
> [Other: 1.1 ms]
> [Choose CSet: 0.0 ms]
> [ 7677M->7636M(8000M)]
> [Times: user=0.12 sys=0.00, real=0.01 secs]
>
> But every 5-10 minutes I see a GC pause that lasts 10-15 seconds:
> [todd@monster01 logs]$ grep 'Full GC' gc-hbase.log | tail
> 2010-07-06T12:50:41.216-0700: 52521.446: [Full GC 7934M->4865M(8000M),
> 9.8907800 secs]
> 2010-07-06T12:55:39.802-0700: 52820.032: [Full GC 7930M->4964M(8000M),
> 9.9025520 secs]
> 2010-07-06T13:02:26.872-0700: 53227.102: [Full GC 7934M->4882M(8000M),
> 10.1232190 secs]
> 2010-07-06T13:09:41.049-0700: 53661.279: [Full GC 7938M->5002M(8000M),
> 10.4997760 secs]
> 2010-07-06T13:18:51.531-0700: 54211.761: [Full GC 7938M->4962M(8000M),
> 11.0497380 secs]
>
> These pauses are pretty unacceptable for soft real time operation.
>
> Am I missing some tuning that should be done for G1GC for applications
> like this? Is 20ms out of 80ms too aggressive a target for the garbage
> rates we're generating?
>
> My actual live heap usage should be very stable around 5GB - the
> application very carefully accounts its live object set at around 60%
> of the max heap (as you can see in the logs above).
>
> At this point we are considering doing crazy things like ripping out
> our main memory consumers into a custom slab allocator, and manually
> reference count the byte array slices. But, if we can get G1GC to work
> for us, it will save a lot of engineering on the application side!
>
> Thanks
> -Todd
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
> ------------------------------------------------------------------------
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
Did you try doubling the heap size? You might want to post a full
log so we can see what's happening between those full collections.
Also, If you have comparable CMS logs
all the better, as a known starting point. The full gc's almost
look like the heap got too full, so it must mean that incremental
collection is not keeping up with the rate of garbage generation.
Also, what's the JDK build you are running?
-- ramki
On 07/06/10 13:27, Todd Lipcon wrote:
> Hi all,
>
> I work on HBase, a distributed database written in Java. We generally
> run on large heaps (8GB+), and our object lifetime distribution has
> proven pretty problematic for garbage collection (we manage a multi-GB
> LRU cache inside the process, so in CMS we tenure a lot of byte arrays
> which later get collected).
>
> In Java6, we generally run with the CMS collector, tuning down
> CMSInitiatingOccupancyFraction and constraining MaxNewSize to achieve
> fairly low pause GC, but after a week or two of uptime we often run into
> full heap compaction which takes several minutes and wreaks havoc on the
> system.
>
> Needless to say, we've been watching the development of the G1 GC with
> anticipation for the last year or two. Finally it seems in the latest
> build of JDK7 it's stable enough for actual use (previously it would
> segfault within an hour or so). However, in my testing I'm seeing a fair
> amount of 8-10 second Full GC pauses.
>
> The flags I'm passing are:
>
> -Xmx8000m -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:GCPauseIntervalMillis=80
>
> Most of the pauses I see in the GC log are around 10-20ms as expected:
>
> 2010-07-05T22:43:19.849-0700: 1680.079: [GC pause (partial), 0.01209000
> secs]
> [Parallel Time: 10.5 ms]
> [Update RS (Start) (ms): 1680080.2 1680080.1 1680080.2
> 1680079.9 1680080.0 1680080.2 1680080.1 1680080.1 1680080.0
> 1680080.1 1680080.0 1680079.9 1680081.5]
> [Update RS (ms): 1.4 2.0 2.2 1.8 1.7 1.4 2.5 2.2 1.9 2.5
> 1.7 1.7 0.1
> Avg: 1.8, Min: 0.1, Max: 2.5]
> [Processed Buffers : 8 1 3 1 1 7 3 2 6 2 7 8 3
> Sum: 52, Avg: 4, Min: 1, Max: 8]
> [Ext Root Scanning (ms): 0.7 0.5 0.5 0.3 0.5 0.7 0.4 0.5
> 0.4 0.5 0.3 0.3 0.0
> Avg: 0.4, Min: 0.0, Max: 0.7]
> [Mark Stack Scanning (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
> 0.0 0.0 0.0 0.0 0.0
> Avg: 0.0, Min: 0.0, Max: 0.0]
> [Scan RS (ms): 0.9 0.4 0.1 0.7 0.7 0.9 0.0 0.1 0.6 0.0
> 0.8 0.8 0.9
> Avg: 0.5, Min: 0.0, Max: 0.9]
> [Object Copy (ms): 7.2 7.2 7.3 7.3 7.1 7.1 7.0 7.2 7.1
> 6.9 7.1 7.1 7.0
> Avg: 7.1, Min: 6.9, Max: 7.3]
> [Termination (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
> 0.0 0.0 0.0 0.0
> Avg: 0.0, Min: 0.0, Max: 0.0]
> [Other: 0.6 ms]
> [Clear CT: 0.5 ms]
> [Other: 1.1 ms]
> [Choose CSet: 0.0 ms]
> [ 7677M->7636M(8000M)]
> [Times: user=0.12 sys=0.00, real=0.01 secs]
>
> But every 5-10 minutes I see a GC pause that lasts 10-15 seconds:
> [todd@monster01 logs]$ grep 'Full GC' gc-hbase.log | tail
> 2010-07-06T12:50:41.216-0700: 52521.446: [Full GC 7934M->4865M(8000M),
> 9.8907800 secs]
> 2010-07-06T12:55:39.802-0700: 52820.032: [Full GC 7930M->4964M(8000M),
> 9.9025520 secs]
> 2010-07-06T13:02:26.872-0700: 53227.102: [Full GC 7934M->4882M(8000M),
> 10.1232190 secs]
> 2010-07-06T13:09:41.049-0700: 53661.279: [Full GC 7938M->5002M(8000M),
> 10.4997760 secs]
> 2010-07-06T13:18:51.531-0700: 54211.761: [Full GC 7938M->4962M(8000M),
> 11.0497380 secs]
>
> These pauses are pretty unacceptable for soft real time operation.
>
> Am I missing some tuning that should be done for G1GC for applications
> like this? Is 20ms out of 80ms too aggressive a target for the garbage
> rates we're generating?
>
> My actual live heap usage should be very stable around 5GB - the
> application very carefully accounts its live object set at around 60% of
> the max heap (as you can see in the logs above).
>
> At this point we are considering doing crazy things like ripping out our
> main memory consumers into a custom slab allocator, and manually
> reference count the byte array slices. But, if we can get G1GC to work
> for us, it will save a lot of engineering on the application side!
>
> Thanks
> -Todd
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>
>
> ------------------------------------------------------------------------
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
I also work with Todd on this systems (I am one of the other people
with the alternate CMS config) and doubling the heap size from 8 GB to
16GB is a little insane... we'd like to have some amount of reasonable
memory efficiency here... The thing is the more we can get out of our
ram for this block cache, the better performing our systems are. Also
a lot of the settings are self tuning, so if we up the Xmx the size of
the block cache is scaled as well.
-ryan
On Tue, Jul 6, 2010 at 2:12 PM, Y. S. Ramakrishna
<> wrote:
> Did you try doubling the heap size? You might want to post a full
> log so we can see what's happening between those full collections.
> Also, If you have comparable CMS logs
> all the better, as a known starting point. The full gc's almost
> look like the heap got too full, so it must mean that incremental
> collection is not keeping up with the rate of garbage generation.
> Also, what's the JDK build you are running?
>
> -- ramki
>
> On 07/06/10 13:27, Todd Lipcon wrote:
>> Hi all,
>>
>> I work on HBase, a distributed database written in Java. We generally
>> run on large heaps (8GB+), and our object lifetime distribution has
>> proven pretty problematic for garbage collection (we manage a multi-GB
>> LRU cache inside the process, so in CMS we tenure a lot of byte arrays
>> which later get collected).
>>
>> In Java6, we generally run with the CMS collector, tuning down
>> CMSInitiatingOccupancyFraction and constraining MaxNewSize to achieve
>> fairly low pause GC, but after a week or two of uptime we often run into
>> full heap compaction which takes several minutes and wreaks havoc on the
>> system.
>>
>> Needless to say, we've been watching the development of the G1 GC with
>> anticipation for the last year or two. Finally it seems in the latest
>> build of JDK7 it's stable enough for actual use (previously it would
>> segfault within an hour or so). However, in my testing I'm seeing a fair
>> amount of 8-10 second Full GC pauses.
>>
>> The flags I'm passing are:
>>
>> -Xmx8000m -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:GCPauseIntervalMillis=80
>>
>> Most of the pauses I see in the GC log are around 10-20ms as expected:
>>
>> 2010-07-05T22:43:19.849-0700: 1680.079: [GC pause (partial), 0.01209000
>> secs]
>> [Parallel Time: 10.5 ms]
>> [Update RS (Start) (ms): 1680080.2 1680080.1 1680080.2
>> 1680079.9 1680080.0 1680080.2 1680080.1 1680080.1 1680080.0
>> 1680080.1 1680080.0 1680079.9 1680081.5]
>> [Update RS (ms): 1.4 2.0 2.2 1.8 1.7 1.4 2.5 2.2 1.9 2.5
>> 1.7 1.7 0.1
>> Avg: 1.8, Min: 0.1, Max: 2.5]
>> [Processed Buffers : 8 1 3 1 1 7 3 2 6 2 7 8 3
>> Sum: 52, Avg: 4, Min: 1, Max: 8]
>> [Ext Root Scanning (ms): 0.7 0.5 0.5 0.3 0.5 0.7 0.4 0.5
>> 0.4 0.5 0.3 0.3 0.0
>> Avg: 0.4, Min: 0.0, Max: 0.7]
>> [Mark Stack Scanning (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
>> 0.0 0.0 0.0 0.0 0.0
>> Avg: 0.0, Min: 0.0, Max: 0.0]
>> [Scan RS (ms): 0.9 0.4 0.1 0.7 0.7 0.9 0.0 0.1 0.6 0.0
>> 0.8 0.8 0.9
>> Avg: 0.5, Min: 0.0, Max: 0.9]
>> [Object Copy (ms): 7.2 7.2 7.3 7.3 7.1 7.1 7.0 7.2 7.1
>> 6.9 7.1 7.1 7.0
>> Avg: 7.1, Min: 6.9, Max: 7.3]
>> [Termination (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
>> 0.0 0.0 0.0 0.0
>> Avg: 0.0, Min: 0.0, Max: 0.0]
>> [Other: 0.6 ms]
>> [Clear CT: 0.5 ms]
>> [Other: 1.1 ms]
>> [Choose CSet: 0.0 ms]
>> [ 7677M->7636M(8000M)]
>> [Times: user=0.12 sys=0.00, real=0.01 secs]
>>
>> But every 5-10 minutes I see a GC pause that lasts 10-15 seconds:
>> [todd@monster01 logs]$ grep 'Full GC' gc-hbase.log | tail
>> 2010-07-06T12:50:41.216-0700: 52521.446: [Full GC 7934M->4865M(8000M),
>> 9.8907800 secs]
>> 2010-07-06T12:55:39.802-0700: 52820.032: [Full GC 7930M->4964M(8000M),
>> 9.9025520 secs]
>> 2010-07-06T13:02:26.872-0700: 53227.102: [Full GC 7934M->4882M(8000M),
>> 10.1232190 secs]
>> 2010-07-06T13:09:41.049-0700: 53661.279: [Full GC 7938M->5002M(8000M),
>> 10.4997760 secs]
>> 2010-07-06T13:18:51.531-0700: 54211.761: [Full GC 7938M->4962M(8000M),
>> 11.0497380 secs]
>>
>> These pauses are pretty unacceptable for soft real time operation.
>>
>> Am I missing some tuning that should be done for G1GC for applications
>> like this? Is 20ms out of 80ms too aggressive a target for the garbage
>> rates we're generating?
>>
>> My actual live heap usage should be very stable around 5GB - the
>> application very carefully accounts its live object set at around 60% of
>> the max heap (as you can see in the logs above).
>>
>> At this point we are considering doing crazy things like ripping out our
>> main memory consumers into a custom slab allocator, and manually
>> reference count the byte array slices. But, if we can get G1GC to work
>> for us, it will save a lot of engineering on the application side!
>>
>> Thanks
>> -Todd
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
>>
>> ------------------------------------------------------------------------
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
Two questions:
(1) do you enable class unloading with CMS? (I do not see that
below in yr option list, but wondered.)
(2) does your application load classes, or intern a lot of strings?
If i am reading the logs right, G1 appears to reclaim less
and less of the heap in each cycle until a full collection
intervenes, and I have no real explanation for this behaviour
except that perhaps there's something in the perm gen that
keeps stuff in the remainder of the heap artificially live.
G1 does not incrementally collect the young gen, so this is
plausible. But CMS does not either by default and I do not see
that option in the CMS options list you gave below. It would
be instructive to see what the comparable CMS logs look like.
May be then you could start with the same heap shapes for the
two and see if you can get to the bottom of the full gc (which
as i understand you get to more quickly w/G1 than you did
w/CMS).
-- ramki
On 07/06/10 14:24, Todd Lipcon wrote:
> On Tue, Jul 6, 2010 at 2:09 PM, Jon Masamitsu <
> > wrote:
>
> Todd,
>
> Could you send a segment of the GC logs from the beginning
> through the first dozen or so full GC's?
>
>
> Sure, I just put it online at:
>
> http://cloudera-todd.s3.amazonaws.com/gc-log-g1gc.txt
>
>
>
> Exactly which version of the JVM are you using?
>
> java -version
>
> will tell us.
>
>
> Latest as of last night:
>
> [todd@monster01 ~]$ ./jdk1.7.0/jre/bin/java -version
> java version "1.7.0-ea"
> Java(TM) SE Runtime Environment (build 1.7.0-ea-b99)
> Java HotSpot(TM) 64-Bit Server VM (build 19.0-b03, mixed mode)
>
>
> Do you have a test setup where you could do some experiments?
>
>
> Sure, I have a five node cluster here where I do lots of testing, happy
> to try different builds/options/etc (though I probably don't have time
> to apply patches and rebuild the JDK myself)
>
>
> Can you send the set of CMS flags you use? It might tell
> us something about the GC behavior of you application.
> Might not tell us anything but it's worth a look.
>
>
> Different customers have found different flags to work well for them.
> One user uses the following:
>
>
> -Xmx12000m -XX:+HeapDumpOnOutOfMemoryError -XX:+UseConcMarkSweepGC \
> -XX:NewSize=64m -XX:MaxNewSize=64m -XX:CMSInitiatingOccupancyFraction=40 \
>
>
> Another uses:
>
>
> -XX:+DoEscapeAnalysis -XX:+AggressiveOpts -XX:+UseConcMarkSweepGC
> -XX:NewSize=64m -XX:MaxNewSize=64m
> -XX:CMSInitiatingOccupancyFraction=88 -verbose:gc -XX:+PrintGCDetails
>
>
>
>
>
>
>
> The particular tuning options probably depend on the actual cache
> workload of the user. I tend to recommend CMSInitiatingOccupancyFraction
> around 75 or so, since the software maintains about 60% heap usage. I
> also think a NewSize slightly larger would improve things a bit, but if
> it gets more than 256m or so, the ParNew pauses start to be too long for
> a lot of use cases.
>
> Regarding CMS logs, I can probably restart this test later this
> afternoon on CMS and run it for a couple hours, but it isn't likely to
> hit the multi-minute compaction that quickly. It happens more in the wild.
>
> -Todd
>
>
>
> On 07/06/10 13:27, Todd Lipcon wrote:
>> Hi all,
>>
>> I work on HBase, a distributed database written in Java. We
>> generally run on large heaps (8GB+), and our object lifetime
>> distribution has proven pretty problematic for garbage collection
>> (we manage a multi-GB LRU cache inside the process, so in CMS we
>> tenure a lot of byte arrays which later get collected).
>>
>> In Java6, we generally run with the CMS collector, tuning down
>> CMSInitiatingOccupancyFraction and constraining MaxNewSize to
>> achieve fairly low pause GC, but after a week or two of uptime we
>> often run into full heap compaction which takes several minutes
>> and wreaks havoc on the system.
>>
>> Needless to say, we've been watching the development of the G1 GC
>> with anticipation for the last year or two. Finally it seems in
>> the latest build of JDK7 it's stable enough for actual use
>> (previously it would segfault within an hour or so). However, in
>> my testing I'm seeing a fair amount of 8-10 second Full GC pauses.
>>
>> The flags I'm passing are:
>>
>> -Xmx8000m -XX:+UseG1GC -XX:MaxGCPauseMillis=20
>> -XX:GCPauseIntervalMillis=80
>>
>> Most of the pauses I see in the GC log are around 10-20ms as expected:
>>
>> 2010-07-05T22:43:19.849-0700: 1680.079: [GC pause (partial),
>> 0.01209000 secs]
>> [Parallel Time: 10.5 ms]
>> [Update RS (Start) (ms): 1680080.2 1680080.1 1680080.2
>> 1680079.9 1680080.0 1680080.2 1680080.1 1680080.1 1680080.0
>> 1680080.1 1680080.0 1680079.9 1680081.5]
>> [Update RS (ms): 1.4 2.0 2.2 1.8 1.7 1.4 2.5 2.2
>> 1.9 2.5 1.7 1.7 0.1
>> Avg: 1.8, Min: 0.1, Max: 2.5]
>> [Processed Buffers : 8 1 3 1 1 7 3 2 6 2 7 8 3
>> Sum: 52, Avg: 4, Min: 1, Max: 8]
>> [Ext Root Scanning (ms): 0.7 0.5 0.5 0.3 0.5 0.7 0.4
>> 0.5 0.4 0.5 0.3 0.3 0.0
>> Avg: 0.4, Min: 0.0, Max: 0.7]
>> [Mark Stack Scanning (ms): 0.0 0.0 0.0 0.0 0.0 0.0
>> 0.0 0.0 0.0 0.0 0.0 0.0 0.0
>> Avg: 0.0, Min: 0.0, Max: 0.0]
>> [Scan RS (ms): 0.9 0.4 0.1 0.7 0.7 0.9 0.0 0.1 0.6
>> 0.0 0.8 0.8 0.9
>> Avg: 0.5, Min: 0.0, Max: 0.9]
>> [Object Copy (ms): 7.2 7.2 7.3 7.3 7.1 7.1 7.0 7.2
>> 7.1 6.9 7.1 7.1 7.0
>> Avg: 7.1, Min: 6.9, Max: 7.3]
>> [Termination (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
>> 0.0 0.0 0.0 0.0 0.0
>> Avg: 0.0, Min: 0.0, Max: 0.0]
>> [Other: 0.6 ms]
>> [Clear CT: 0.5 ms]
>> [Other: 1.1 ms]
>> [Choose CSet: 0.0 ms]
>> [ 7677M->7636M(8000M)]
>> [Times: user=0.12 sys=0.00, real=0.01 secs]
>>
>> But every 5-10 minutes I see a GC pause that lasts 10-15 seconds:
>> [todd@monster01 logs]$ grep 'Full GC' gc-hbase.log | tail
>> 2010-07-06T12:50:41.216-0700: 52521.446: [Full GC
>> 7934M->4865M(8000M), 9.8907800 secs]
>> 2010-07-06T12:55:39.802-0700: 52820.032: [Full GC
>> 7930M->4964M(8000M), 9.9025520 secs]
>> 2010-07-06T13:02:26.872-0700: 53227.102: [Full GC
>> 7934M->4882M(8000M), 10.1232190 secs]
>> 2010-07-06T13:09:41.049-0700: 53661.279: [Full GC
>> 7938M->5002M(8000M), 10.4997760 secs]
>> 2010-07-06T13:18:51.531-0700: 54211.761: [Full GC
>> 7938M->4962M(8000M), 11.0497380 secs]
>>
>> These pauses are pretty unacceptable for soft real time operation.
>>
>> Am I missing some tuning that should be done for G1GC for
>> applications like this? Is 20ms out of 80ms too aggressive a
>> target for the garbage rates we're generating?
>>
>> My actual live heap usage should be very stable around 5GB - the
>> application very carefully accounts its live object set at around
>> 60% of the max heap (as you can see in the logs above).
>>
>> At this point we are considering doing crazy things like ripping
>> out our main memory consumers into a custom slab allocator, and
>> manually reference count the byte array slices. But, if we can get
>> G1GC to work for us, it will save a lot of engineering on the
>> application side!
>>
>> Thanks
>> -Todd
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>> ------------------------------------------------------------------------
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
On 07/07/10 08:45, Todd Lipcon wrote:
...
>
> Overnight I saw one "concurrent mode failure".
...
> 2010-07-07T07:56:27.786-0700: 28490.203: [GC 28490.203: [ParNew
> (promotion failed): 59008K->59008K(59008K), 0.0179250 secs]28490.221:
> [CMS2010-07-07T07:56:27.901-0700: 28490.317: [CMS-concurrent-preclean:
> 0.556/0.947 secs] [Times:
> user=5.76 sys=0.26, real=0.95 secs]
> (concurrent mode failure): 6359176K->4206871K(8323072K), 17.4366220
> secs] 6417373K->4206871K(8382080K), [CMS Perm : 18609K->18565K(31048K)],
> 17.4546890 secs] [Times: user=11.17 sys=0.09, real=17.45 secs]
>
> I've interpreted pauses like this as being caused by fragmentation,
> since the young gen is 64M, and the old gen here has about 2G free. If
> there's something I'm not understanding about CMS, and I can tune it
> more smartly to avoid these longer pauses, I'm happy to try.
Yes the old gen must be fragmented. I'll look at the data you have
made available (for CMS). The CMS log you uploaded does not have the
suffix leading into the concurrent mode failure ypu display above
(it stops less than 2500 s into the run). If you could include
the entire log leading into the concurrent mode failures, it would
be a great help. Do you have large arrays in your
application? The shape of the promotion graph for CMS is somewhat
jagged, indicating _perhaps_ that. Yes, +PrintTenuringDistribution
would shed a bit more light. As regards fragmentation, it can be
tricky to tune against, but we can try once we understand a bit
more about the object sizes and demographics.
I am sure you don't have an easily shared test case, so we
can reproduce both the CMS fragmentation and the G1 full gc
issues locally for quickest progress on this?
-- ramki
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
On 07/07/10 17:18, Todd Lipcon wrote:
...
> Looking at the graph you attached, it appears that the low-water mark
> stabilizes at somewhere between 4.5G and 5G. The configuration I'm
> running is to allocate 40% of the heap to Memstore and 20% of the heap
> to the LRU cache. For an 8G heap, this is 4.8GB. So, for this
> application it's somewhat expected that, as it runs, it will accumulate
> more and more data until it reaches this threshold. The data is, of
> course, not *permanent*, but it's reasonably long-lived, so it makes
> sense to me that it should go into the old generation.
Ah, i see. In that case, i think you could try using a slightly larger old
gen. If the old gen stabilizes at 4.2 GB, we should allow as much for slop.
i.e. make the old gen 8.4 GB (or whatever is the measured stable
old gen occupancy), then add to that the young gen size, and use
that for the whole heap. I would be even more aggressive
and grant more to the old gen -- as i said earlier perhaps
double the old gen from its present size. If that doesn;t work
we know that something is amiss in the way we are going at this.
If it works, we can iterate downwards from a config that we know
works, down to what may be considered an acceptable space overhead
for GC.
>
> If you like, I can tune down those percentages to 20/20 instead of
> 20/40, and I think we'll see the same pattern, just stabilized around
> 3.2GB. This will probably delay the full GCs, but still eventually hit
> them. It's also way lower than we can really go - customers won't like
> "throwing away" 60% of the allocated heap to GC!
I understand that sentiment. I want us to get to a state where we are able
to completely avoid the creeping fragmentation, if possible. There are
other ways to tune for this, but they are more labour-intensive and tricky,
and I would not want to go into that lightly. You might want to contact
your Java support for help with that.
>
>
> Perhaps jmap -histo:live or +PrintClassHistogram[AfterFullGC]
> will help you get to the bottom of yr leak. Once the leak is plugged
> perhaps we could come back to the G1 tuning effort? (We have some
> guesses as to what might be happening and the best G1 minds are
> chewing on the info you provided so far, for which thanks!)
>
>
> I can try running with those options and see what I see, but I've
> already spent some time looking at heap dumps, and not found any leaks,
> so I'm pretty sure it's not the issue.
OK, in that case it's not worth doing, since you've already ruled
out leaks.
I'll think some more about this meanwhile.
thanks.
-- ramki
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
On 07/07/10 17:32, Todd Lipcon wrote:
...
> OK, I can try some tests with cache configured for only 40% heap usage.
> Should I run these tests with CMS or G1?
I'd first try CMS, and if that works, try G1.
>
>
>
>
> If you like, I can tune down those percentages to 20/20 instead
> of 20/40, and I think we'll see the same pattern, just
> stabilized around 3.2GB. This will probably delay the full GCs,
> but still eventually hit them. It's also way lower than we can
> really go - customers won't like "throwing away" 60% of the
> allocated heap to GC!
>
>
> I understand that sentiment. I want us to get to a state where we
> are able
> to completely avoid the creeping fragmentation, if possible. There are
> other ways to tune for this, but they are more labour-intensive and
> tricky,
> and I would not want to go into that lightly. You might want to contact
> your Java support for help with that.
>
>
> Yep, we've considered various solutions involving managing our own
> ref-counted slices of a single pre-allocated byte array - essentially
> writing our own slab allocator. In theory this should make all of the
> GCable objects constrained to a small number of sizes, and thus prevent
> fragmentation, but it's quite a project to undertake :)
That would be overdoing it. I didn't mean anything so drastic and certainly
nothing so drastic at the application level. When I said "labour intensive"
I meant tuning GC to avoid that kind of fragmentation would be more work.
>
> Regarding Java support, as an open source project we have no such
> luxury. Projects like HBase and Hadoop, though, are pretty visible to
> users as "big Java apps", so getting them working well on the GC front
> does good things for Java adoption in the database/distributed systems
> community, I think.
I agree, and we certainly should.
-- ramki
>
> -Todd
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>
>
> ------------------------------------------------------------------------
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
> Am I missing some tuning that should be done for G1GC for applications like
> this? Is 20ms out of 80ms too aggressive a target for the garbage rates
> we're generating?
I have never run HBase, but in an LRU stress test (I posted about it a
few months ago) I specifically observed remembered set scanning costs
go way up. In addition I was seeing fallbacks to full GC:s recently in
a slightly different test that I also posed about to -use, and that
turned out to be a result of the estimated rset scanning costs being
so high that regions were never selected for eviction even though they
had very little live data. I would be very interested to hear if
you're having the same problem. My last post on the topic is here:
http://mail.openjdk.java.net/pipermail/hotspot-gc-use/2010-June/000652.html
Including the link to the (throw-away) patch that should tell you
whether this is what's happening:
http://distfiles.scode.org/mlref/g1/g1_region_live_stats_hack.patch
Out of personal curiosity I'd be very interested to hear whether this
is what's happening to you (in a real reasonable use-case rather than
a synthetic benchmark).
My sense (and hotspot/g1 developers please smack me around if I am
misrepresenting anything here) is that the effect I saw (with rset
scanning costs) could cause perpetual memory grow (until fallback to
full GC) in two ways:
(1) The estimated (and possibly real) cost of rset scanning for a
single region could be so high that it is never possible to select it
for eviction given the asked for pause time goals. Hence, such a
region effectively "leaks" until full GC.
(2) The estimated (and possibly real) cost of rset scanning for
regions may be so high that there are, in practice, always other
regions selected for high pay-off/cost ratios, such that they end up
never being collected even if theoretically a single region could be
evicted within the pause time goal.
These are effectively the same thing, with (1) being an extreme case of (2).
In both cases, the effect should be mitigated (and have been in the
case where I did my testing), but as far as I can tell not generally
"fixed", by increasing the pause time goals.
It is unclear to me how this is intended to be handled. The original
g1 paper mentions an rset scanning thread that I may suspect would be
intended to help do rset scanning in the background such that regions
like these could be evicted more cheaply during the STW eviction
pause; but I didn't find such a thread anywhere in the source code -
but I may very well just be missing it.
--
/ Peter Schuller
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
Hi Peter --
Yes, my guess was also that something (possibly along the lines
you stated below) was preventing the selection of certain (sets
of) regions for evacuation on a regular basis ... I am told there
are flags that will allow you to get verbose details on what is
or is not selected for inclusion in the collection set; perhaps
that will help you get down to the bottom of this. Did you say
you had a test case that showed this behaviour? Filing a bug
with that test case may be the quickest way to get this before
the right set of eyes. Over to the G1 cognoscenti.
-- ramki
On 07/12/10 09:02, Peter Schuller wrote:
>> Am I missing some tuning that should be done for G1GC for applications like
>> this? Is 20ms out of 80ms too aggressive a target for the garbage rates
>> we're generating?
>
> I have never run HBase, but in an LRU stress test (I posted about it a
> few months ago) I specifically observed remembered set scanning costs
> go way up. In addition I was seeing fallbacks to full GC:s recently in
> a slightly different test that I also posed about to -use, and that
> turned out to be a result of the estimated rset scanning costs being
> so high that regions were never selected for eviction even though they
> had very little live data. I would be very interested to hear if
> you're having the same problem. My last post on the topic is here:
>
> http://mail.openjdk.java.net/pipermail/hotspot-gc-use/2010-June/000652.html
>
> Including the link to the (throw-away) patch that should tell you
> whether this is what's happening:
>
> http://distfiles.scode.org/mlref/g1/g1_region_live_stats_hack.patch
>
> Out of personal curiosity I'd be very interested to hear whether this
> is what's happening to you (in a real reasonable use-case rather than
> a synthetic benchmark).
>
> My sense (and hotspot/g1 developers please smack me around if I am
> misrepresenting anything here) is that the effect I saw (with rset
> scanning costs) could cause perpetual memory grow (until fallback to
> full GC) in two ways:
>
> (1) The estimated (and possibly real) cost of rset scanning for a
> single region could be so high that it is never possible to select it
> for eviction given the asked for pause time goals. Hence, such a
> region effectively "leaks" until full GC.
>
> (2) The estimated (and possibly real) cost of rset scanning for
> regions may be so high that there are, in practice, always other
> regions selected for high pay-off/cost ratios, such that they end up
> never being collected even if theoretically a single region could be
> evicted within the pause time goal.
>
> These are effectively the same thing, with (1) being an extreme case of (2).
>
> In both cases, the effect should be mitigated (and have been in the
> case where I did my testing), but as far as I can tell not generally
> "fixed", by increasing the pause time goals.
>
> It is unclear to me how this is intended to be handled. The original
> g1 paper mentions an rset scanning thread that I may suspect would be
> intended to help do rset scanning in the background such that regions
> like these could be evicted more cheaply during the STW eviction
> pause; but I didn't find such a thread anywhere in the source code -
> but I may very well just be missing it.
>
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
Peter and Todd,
Any chance of setting +PrintHeapAtGC and -XX:+PrintHeapAtGCExtended and
sending us the log, or part of it (say between two Full GCs)? Be
prepared: this will generate piles of output. But it will give us
per-region information that might shed more light on the cause of the
issue.... thanks,
Tony, HS GC Group
Peter Schuller wrote:
>> Am I missing some tuning that should be done for G1GC for applications like
>> this? Is 20ms out of 80ms too aggressive a target for the garbage rates
>> we're generating?
>>
>
> I have never run HBase, but in an LRU stress test (I posted about it a
> few months ago) I specifically observed remembered set scanning costs
> go way up. In addition I was seeing fallbacks to full GC:s recently in
> a slightly different test that I also posed about to -use, and that
> turned out to be a result of the estimated rset scanning costs being
> so high that regions were never selected for eviction even though they
> had very little live data. I would be very interested to hear if
> you're having the same problem. My last post on the topic is here:
>
> http://mail.openjdk.java.net/pipermail/hotspot-gc-use/2010-June/000652.html
>
> Including the link to the (throw-away) patch that should tell you
> whether this is what's happening:
>
> http://distfiles.scode.org/mlref/g1/g1_region_live_stats_hack.patch
>
> Out of personal curiosity I'd be very interested to hear whether this
> is what's happening to you (in a real reasonable use-case rather than
> a synthetic benchmark).
>
> My sense (and hotspot/g1 developers please smack me around if I am
> misrepresenting anything here) is that the effect I saw (with rset
> scanning costs) could cause perpetual memory grow (until fallback to
> full GC) in two ways:
>
> (1) The estimated (and possibly real) cost of rset scanning for a
> single region could be so high that it is never possible to select it
> for eviction given the asked for pause time goals. Hence, such a
> region effectively "leaks" until full GC.
>
> (2) The estimated (and possibly real) cost of rset scanning for
> regions may be so high that there are, in practice, always other
> regions selected for high pay-off/cost ratios, such that they end up
> never being collected even if theoretically a single region could be
> evicted within the pause time goal.
>
> These are effectively the same thing, with (1) being an extreme case of (2).
>
> In both cases, the effect should be mitigated (and have been in the
> case where I did my testing), but as far as I can tell not generally
> "fixed", by increasing the pause time goals.
>
> It is unclear to me how this is intended to be handled. The original
> g1 paper mentions an rset scanning thread that I may suspect would be
> intended to help do rset scanning in the background such that regions
> like these could be evicted more cheaply during the STW eviction
> pause; but I didn't find such a thread anywhere in the source code -
> but I may very well just be missing it.
>
>
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
Ramki/Tony,
> Any chance of setting +PrintHeapAtGC and -XX:+PrintHeapAtGCExtended and
> sending us the log, or part of it (say between two Full GCs)? Be prepared:
> this will generate piles of output. But it will give us per-region
> information that might shed more light on the cause of the issue.... thanks,
So what I have in terms of data is (see footnotes for urls references in []):
(a) A patch[1] that prints some additional information about estimated
costs of region eviction, and disables the GC efficiency check that
normally terminates selection of regions. (Note: This is a throw-away
patch for debugging; it's not intended as a suggested change for
inclusion.)
(b) A log[2] showing the output of a test run I did just now, with
both your flags above and my patch enabled (but without disabling the
efficiency check). It shows fallback to full GC when the actual live
set size is 252 MB, and the maximum heap size is 2 GB (in other words,
~ 12% liveness). An easy way to find the point of full gc is to search
for the string 'full 1'.
(c) A file[3] with the effective VM options during the test.
(d) Instructions for how to run the test to reproduce it (I'll get to
that at the end; it's simplified relative to previously).
(e) Nature of the test.
Discussion:
WIth respect to region information: I originally tried it in response
to your recommendation earlier, but I found I did not see the
information I was after. Perhaps I was just misreading it, but I
mostly just saw either 0% or 100% fullness, and never the actual
liveness estimate as produced by the mark phase. In the log I am
referring to in this E-Mail, you can see that the last printout of
region information just before the live GC fits this pattern; I just
don't see anything that looks like legitimate liveness information
being printed. (I don't have time to dig back into it right now to
double-check what it's printing.)
If you scroll up from the point of the full gc until you find a bunch
of output starting with "predict_region_elapsed_time_ms" you see some
output resulting from the patch, with pretty extreme values such as:
predict_region_elapsed_time_ms: 34.378642ms total, 34.021154ms rs scan
(46542 cnum), 0.040069 copy time (20704 bytes), 0.317419 other time
predict_region_elapsed_time_ms: 45.020866ms total, 44.653222ms rs scan
(61087 cnum), 0.050225 copy time (25952 bytes), 0.317419 other time
predict_region_elapsed_time_ms: 16.250033ms total, 15.887065ms rs scan
(21734 cnum), 0.045550 copy time (23536 bytes), 0.317419 other time
predict_region_elapsed_time_ms: 226.942877ms total, 226.559163ms rs
scan (309940 cnum), 0.066296 copy time (34256 bytes), 0.317419 other
time
predict_region_elapsed_time_ms: 542.344828ms total, 541.954750ms rs
scan (741411 cnum), 0.072659 copy time (37544 bytes), 0.317419 other
time
predict_region_elapsed_time_ms: 668.595597ms total, 668.220877ms rs
scan (914147 cnum), 0.057301 copy time (29608 bytes), 0.317419 other
time
So in the most extreme case in the excerpt above, that's > half a
second of estimate rset scanning time for a single region with 914147
cards to be scanned. While not all are that extreme, lots and lots of
regions are very expensive and almost only due to rset scanning costs.
If you scroll down a bit to the first (and ONLY) partial that happened
after the statistics accumulating from the marking phase, we see more
output resulting form the patch. At the end, we see:
(picked region; 0.345890ms predicted; 1.317244ms remaining; 15kb
marked; 15kb maxlive; 1-1% liveness)
(393380 KB left in heap.)
(picked region; 0.345963ms predicted; 0.971354ms remaining; 15kb
marked; 15kb maxlive; 1-1% liveness)
(393365 KB left in heap.)
(picked region; 0.346036ms predicted; 0.625391ms remaining; 15kb
marked; 15kb maxlive; 1-1% liveness)
(393349 KB left in heap.)
(no more marked regions; next region too expensive (adaptive;
predicted 0.346036ms > remaining 0.279355ms))
So in other words, it picked a bunch of regions in order of "lowest
hanging fruit". The *least* low hanging fruit picked still had
liveness at 1%; in other words, there's plenty of further regions that
ideally should be collected because they contain almost no garbage
(ignoring the cost of collecting them).
In this case, it stopped picking regions because the next region to be
picked, though cheap, was the straw that broke the camel's neck and we
simply exceeded the alloted time for this particular GC.
However, after this partial completes, it reverts back to doing just
young gc:s. In other words, even though there's *plenty* of regions
with very low liveness, further partials aren't happening.
By applying this part of the patch:
- (adaptive_young_list_length() &&
+ (adaptive_young_list_length() && false && // scodetodo
I artificially force g1 to not fall back to doing young gc:s for
efficiency reasons. When I run with that change, I don't experience
the slow perpetual growth until fallback to full GC. If I remember
correctly though, the rset scanning cost is in fact high, but I don't
have details saved and I'm afraid I don't have time to re-run those
tests right now and compare numbers.
Reproducing it:
I made some changes and the test case should now hopefully be easy to
run assuming you have maven installed. The github project is at:
http://github.com/scode/httpgctest
There is a README, but the shortest possible instructions to
re-produce the test that I did:
git clone git://github.com/scode/httpgctest.git
cd httpgctest.git
git checkout 20100714_1 # grab from appropriate tag, in case I
change master
mvn package
HTTPGCTEST_LOGGC=gc.log ./run.sh
That should start the http server; then run concurrently:
while [ 1 ] ; do curl 'http://localhost:9191/dropdata?ratio=0.10' ;
curl 'http://localhost:9191/gendata?amount=25000' ; sleep 0.1 ; done
And then just wait and observe.
Nature of the test:
So the test if run as above will essentially reach a steady state of
equilibrium with about 25000 pieces of data in a clojure immutable
map. The result is that a significant amount of new data is being
allocated, but very little writing to old regions is happening. The
garbage generated is very well spread out over the entire heap because
it goes through all objects and drops 10% (the ratio=0.10) for each
iteration, after which it adds 25000 new items.
In other words; not a lot of old gen writing, but lots of writes to
the young gen referencing objects in the old gen.
[1] http://distfiles.scode.org/mlref/g1/g1_region_live_stats_hack.patch
[2] http://distfiles.scode.org/mlref/gctest/httpgctest-g1-fullgc-20100714/gc-fullfallback.log
[3] http://distfiles.scode.org/mlref/gctest/httpgctest-g1-fullgc-20100714/vmoptions.txt
--
/ Peter Schuller
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
> I consistently am seeing "Other time" estimates in the >40ms range. Given my
> pause time goal of 20ms, these estimates are I think excluding most of the
> regions from collectability. I haven't been able to dig around yet to figure
> out where the long estimate for "other time" is coming from - in the
> collections logged it sometimes shows fairly high "Other" but the "Choose
> CSet" component is very short.
(The following is wannabe speculation based on limited understanding
of the code, please take it with a grain of salt.)
My first thought here is swapping. My reading is that other time is
going to be the collection set selection time plus the collection set
free time (or at least intended to be). I think (am I wrong?) that
this should be really low under normal circumstances since no "bulk"
work is done really; in particular the *per-region* cost should be
low.
If the cost of these operations *per region* ended up being predicted
to > 40ms, I wonder if this was not due to swapping?
Additionally: As far as I can tell the estimated 'other' cost is based
on a history of the cost from previous GC:s and completely independent
of the particular region being evaluated.
Anyways, I suspect you've already confirmed that the system is not
actively swapping at the time of the fallback to full GC. But here is
one low-confidence hypothesis (it would be really great to hear from
one of the gc devs whether it is even remotely plausible):
* At some point in time, there was swapping happening affecting GC
operations such that the work done do gather stats and select regions
was slow (makes some sense since that should touch lots of distinct
regions and you don't need a lot of those memory accesses swapping to
accumulate quite a bit of time).
* This screwed up the 'other' cost history and thus the prediction,
possibly for both young and non-young regions.
* I believe young collections would never be entirely prevented due to
pause time goals, so here the cost history and thus predictions would
always have time to recover and you would not notice any effect
looking at the behavior of the system down the line.
* Non-young "other" cost was so high that non-young regions were never
selected. This in turn meant that additional cost history for the
"other" category was never recorded, preventing recovery from the
temporary swap storm.
* The end result is that no non-young regions are ever collected, and
you end up falling back to full GC once the young collections have
"leaked" enough garbage.
Thoughts, anyone?
--
/ Peter Schuller
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
> I consistently am seeing "Other time" estimates in the >40ms range. Given my
> pause time goal of 20ms, these estimates are I think excluding most of the
Btw, to test the hypothesis: When you say "constantly", are the times
in fact so consistent that it's either exactly the same or almost,
possibly being consistent with my proposed hypothesis that the
non-young "other" time is stuck? If the young other time is not stuck
I guess one might see some variation (I seem to get < 1 ms on my
machine) but not a lot at all in comparison to 40ms. If you're seeing
variation like 40-42 all the time, and it never decreasing
significantly after it reached the 40ms range, that would be
consistent with the hypothesis I believe.
--
/ Peter Schuller
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
> There shouldn't be any swapping during the tests - I've got RAM fairly
> carefully allocated and I believe swappiness was tuned down on those
> machines, though I will double check to be certain.
Does HBase mmap() significant amounts of memory for I/O purposes? I'm
not very familiar with HBase and a quick Googling didn't yield an
answer.
With extensive mmap():ed I/O, excessive swapping of the application
seems to be a common problem even with significant memory margins,
sometimes even with swapiness turned down to 0. I've seen it happen
under several circumstances, and based on reports on the
cassandra-user mailing list during the past couple of months it seems
I'm not alone.
To be sure I recommend checking actual swapping history (or at least
check that the absolute amount of memory swapped out is reasonable
over time).
> I'll try to read through your full email in detail while looking at the
> source and the G1 paper -- right now it's a bit above my head :)
Well, just to re-iterate though I have really only begun looking at it
myself and my ramblings may be completely off the mark.
> FWIW, my tests on JRockit JRRT's gcprio:deterministic collector didn't go
> much better - eventually it fell back to a full compaction which lasted 45
> seconds or so. HBase must be doing something that's really hard for GCs to
> deal with - either on the heuristics front or on the allocation pattern
> front.
Interesting. I don't know a lot about JRockit's implementation since
not a lot of information seems to be available. I did my LRU
micro-benchmark with a ~20-30 GB heap and JRockit. I could definitely
press it hard enough to cause a fallback, but that seemed to be
directly as a result of high allocation rates simply exceeding the
forward progress made by the GC (based on blackbox observation
anyway).
(The other problem was that the compaction pauses were never able to
complete; it seems compaction is O(n) with respect to the number of
objects being compacted, and I was unable to make it compact less than
1% per GC (because the command line option only accepted integral
percents), and with my object count the 1% was enough to hit the pause
time requirement so compaction was aborted every time. LIkely this
would have poor results over time as fragmentation becomes
significant.).
Does HBase go into periodic modes of very high allocation rate, or is
it fairly constant over time? I'm thinking that perhaps the concurrent
marking is just not triggered early enough and if large bursts of
allocations happen when the heap is relatively full, that might be the
triggering factor?
--
/ Peter Schuller
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
|
# 15

30-07-2010 10:58 PM
|
|
|
Todd,
Could you send a segment of the GC logs from the beginning
through the first dozen or so full GC's?
Exactly which version of the JVM are you using?
java -version
will tell us.
Do you have a test setup where you could do some experiments?
Can you send the set of CMS flags you use? It might tell
us something about the GC behavior of you application.
Might not tell us anything but it's worth a look.
Jon
On 07/06/10 13:27, Todd Lipcon wrote:
> Hi all,
>
> I work on HBase, a distributed database written in Java. We generally
> run on large heaps (8GB+), and our object lifetime distribution has
> proven pretty problematic for garbage collection (we manage a multi-GB
> LRU cache inside the process, so in CMS we tenure a lot of byte arrays
> which later get collected).
>
> In Java6, we generally run with the CMS collector, tuning down
> CMSInitiatingOccupancyFraction and constraining MaxNewSize to achieve
> fairly low pause GC, but after a week or two of uptime we often run
> into full heap compaction which takes several minutes and wreaks havoc
> on the system.
>
> Needless to say, we've been watching the development of the G1 GC with
> anticipation for the last year or two. Finally it seems in the latest
> build of JDK7 it's stable enough for actual use (previously it would
> segfault within an hour or so). However, in my testing I'm seeing a
> fair amount of 8-10 second Full GC pauses.
>
> The flags I'm passing are:
>
> -Xmx8000m -XX:+UseG1GC -XX:MaxGCPauseMillis=20
> -XX:GCPauseIntervalMillis=80
>
> Most of the pauses I see in the GC log are around 10-20ms as expected:
>
> 2010-07-05T22:43:19.849-0700: 1680.079: [GC pause (partial),
> 0.01209000 secs]
> [Parallel Time: 10.5 ms]
> [Update RS (Start) (ms): 1680080.2 1680080.1 1680080.2
> 1680079.9 1680080.0 1680080.2 1680080.1 1680080.1 1680080.0
> 1680080.1 1680080.0 1680079.9 1680081.5]
> [Update RS (ms): 1.4 2.0 2.2 1.8 1.7 1.4 2.5 2.2 1.9
> 2.5 1.7 1.7 0.1
> Avg: 1.8, Min: 0.1, Max: 2.5]
> [Processed Buffers : 8 1 3 1 1 7 3 2 6 2 7 8 3
> Sum: 52, Avg: 4, Min: 1, Max: 8]
> [Ext Root Scanning (ms): 0.7 0.5 0.5 0.3 0.5 0.7 0.4 0.5
> 0.4 0.5 0.3 0.3 0.0
> Avg: 0.4, Min: 0.0, Max: 0.7]
> [Mark Stack Scanning (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0
> 0.0 0.0 0.0 0.0 0.0 0.0
> Avg: 0.0, Min: 0.0, Max: 0.0]
> [Scan RS (ms): 0.9 0.4 0.1 0.7 0.7 0.9 0.0 0.1 0.6 0.0
> 0.8 0.8 0.9
> Avg: 0.5, Min: 0.0, Max: 0.9]
> [Object Copy (ms): 7.2 7.2 7.3 7.3 7.1 7.1 7.0 7.2 7.1
> 6.9 7.1 7.1 7.0
> Avg: 7.1, Min: 6.9, Max: 7.3]
> [Termination (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
> 0.0 0.0 0.0 0.0
> Avg: 0.0, Min: 0.0, Max: 0.0]
> [Other: 0.6 ms]
> [Clear CT: 0.5 ms]
> [Other: 1.1 ms]
> [Choose CSet: 0.0 ms]
> [ 7677M->7636M(8000M)]
> [Times: user=0.12 sys=0.00, real=0.01 secs]
>
> But every 5-10 minutes I see a GC pause that lasts 10-15 seconds:
> [todd@monster01 logs]$ grep 'Full GC' gc-hbase.log | tail
> 2010-07-06T12:50:41.216-0700: 52521.446: [Full GC 7934M->4865M(8000M),
> 9.8907800 secs]
> 2010-07-06T12:55:39.802-0700: 52820.032: [Full GC 7930M->4964M(8000M),
> 9.9025520 secs]
> 2010-07-06T13:02:26.872-0700: 53227.102: [Full GC 7934M->4882M(8000M),
> 10.1232190 secs]
> 2010-07-06T13:09:41.049-0700: 53661.279: [Full GC 7938M->5002M(8000M),
> 10.4997760 secs]
> 2010-07-06T13:18:51.531-0700: 54211.761: [Full GC 7938M->4962M(8000M),
> 11.0497380 secs]
>
> These pauses are pretty unacceptable for soft real time operation.
>
> Am I missing some tuning that should be done for G1GC for applications
> like this? Is 20ms out of 80ms too aggressive a target for the garbage
> rates we're generating?
>
> My actual live heap usage should be very stable around 5GB - the
> application very carefully accounts its live object set at around 60%
> of the max heap (as you can see in the logs above).
>
> At this point we are considering doing crazy things like ripping out
> our main memory consumers into a custom slab allocator, and manually
> reference count the byte array slices. But, if we can get G1GC to work
> for us, it will save a lot of engineering on the application side!
>
> Thanks
> -Todd
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
> ------------------------------------------------------------------------
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
Did you try doubling the heap size? You might want to post a full
log so we can see what's happening between those full collections.
Also, If you have comparable CMS logs
all the better, as a known starting point. The full gc's almost
look like the heap got too full, so it must mean that incremental
collection is not keeping up with the rate of garbage generation.
Also, what's the JDK build you are running?
-- ramki
On 07/06/10 13:27, Todd Lipcon wrote:
> Hi all,
>
> I work on HBase, a distributed database written in Java. We generally
> run on large heaps (8GB+), and our object lifetime distribution has
> proven pretty problematic for garbage collection (we manage a multi-GB
> LRU cache inside the process, so in CMS we tenure a lot of byte arrays
> which later get collected).
>
> In Java6, we generally run with the CMS collector, tuning down
> CMSInitiatingOccupancyFraction and constraining MaxNewSize to achieve
> fairly low pause GC, but after a week or two of uptime we often run into
> full heap compaction which takes several minutes and wreaks havoc on the
> system.
>
> Needless to say, we've been watching the development of the G1 GC with
> anticipation for the last year or two. Finally it seems in the latest
> build of JDK7 it's stable enough for actual use (previously it would
> segfault within an hour or so). However, in my testing I'm seeing a fair
> amount of 8-10 second Full GC pauses.
>
> The flags I'm passing are:
>
> -Xmx8000m -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:GCPauseIntervalMillis=80
>
> Most of the pauses I see in the GC log are around 10-20ms as expected:
>
> 2010-07-05T22:43:19.849-0700: 1680.079: [GC pause (partial), 0.01209000
> secs]
> [Parallel Time: 10.5 ms]
> [Update RS (Start) (ms): 1680080.2 1680080.1 1680080.2
> 1680079.9 1680080.0 1680080.2 1680080.1 1680080.1 1680080.0
> 1680080.1 1680080.0 1680079.9 1680081.5]
> [Update RS (ms): 1.4 2.0 2.2 1.8 1.7 1.4 2.5 2.2 1.9 2.5
> 1.7 1.7 0.1
> Avg: 1.8, Min: 0.1, Max: 2.5]
> [Processed Buffers : 8 1 3 1 1 7 3 2 6 2 7 8 3
> Sum: 52, Avg: 4, Min: 1, Max: 8]
> [Ext Root Scanning (ms): 0.7 0.5 0.5 0.3 0.5 0.7 0.4 0.5
> 0.4 0.5 0.3 0.3 0.0
> Avg: 0.4, Min: 0.0, Max: 0.7]
> [Mark Stack Scanning (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
> 0.0 0.0 0.0 0.0 0.0
> Avg: 0.0, Min: 0.0, Max: 0.0]
> [Scan RS (ms): 0.9 0.4 0.1 0.7 0.7 0.9 0.0 0.1 0.6 0.0
> 0.8 0.8 0.9
> Avg: 0.5, Min: 0.0, Max: 0.9]
> [Object Copy (ms): 7.2 7.2 7.3 7.3 7.1 7.1 7.0 7.2 7.1
> 6.9 7.1 7.1 7.0
> Avg: 7.1, Min: 6.9, Max: 7.3]
> [Termination (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
> 0.0 0.0 0.0 0.0
> Avg: 0.0, Min: 0.0, Max: 0.0]
> [Other: 0.6 ms]
> [Clear CT: 0.5 ms]
> [Other: 1.1 ms]
> [Choose CSet: 0.0 ms]
> [ 7677M->7636M(8000M)]
> [Times: user=0.12 sys=0.00, real=0.01 secs]
>
> But every 5-10 minutes I see a GC pause that lasts 10-15 seconds:
> [todd@monster01 logs]$ grep 'Full GC' gc-hbase.log | tail
> 2010-07-06T12:50:41.216-0700: 52521.446: [Full GC 7934M->4865M(8000M),
> 9.8907800 secs]
> 2010-07-06T12:55:39.802-0700: 52820.032: [Full GC 7930M->4964M(8000M),
> 9.9025520 secs]
> 2010-07-06T13:02:26.872-0700: 53227.102: [Full GC 7934M->4882M(8000M),
> 10.1232190 secs]
> 2010-07-06T13:09:41.049-0700: 53661.279: [Full GC 7938M->5002M(8000M),
> 10.4997760 secs]
> 2010-07-06T13:18:51.531-0700: 54211.761: [Full GC 7938M->4962M(8000M),
> 11.0497380 secs]
>
> These pauses are pretty unacceptable for soft real time operation.
>
> Am I missing some tuning that should be done for G1GC for applications
> like this? Is 20ms out of 80ms too aggressive a target for the garbage
> rates we're generating?
>
> My actual live heap usage should be very stable around 5GB - the
> application very carefully accounts its live object set at around 60% of
> the max heap (as you can see in the logs above).
>
> At this point we are considering doing crazy things like ripping out our
> main memory consumers into a custom slab allocator, and manually
> reference count the byte array slices. But, if we can get G1GC to work
> for us, it will save a lot of engineering on the application side!
>
> Thanks
> -Todd
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>
>
> ------------------------------------------------------------------------
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
I also work with Todd on this systems (I am one of the other people
with the alternate CMS config) and doubling the heap size from 8 GB to
16GB is a little insane... we'd like to have some amount of reasonable
memory efficiency here... The thing is the more we can get out of our
ram for this block cache, the better performing our systems are. Also
a lot of the settings are self tuning, so if we up the Xmx the size of
the block cache is scaled as well.
-ryan
On Tue, Jul 6, 2010 at 2:12 PM, Y. S. Ramakrishna
<> wrote:
> Did you try doubling the heap size? You might want to post a full
> log so we can see what's happening between those full collections.
> Also, If you have comparable CMS logs
> all the better, as a known starting point. The full gc's almost
> look like the heap got too full, so it must mean that incremental
> collection is not keeping up with the rate of garbage generation.
> Also, what's the JDK build you are running?
>
> -- ramki
>
> On 07/06/10 13:27, Todd Lipcon wrote:
>> Hi all,
>>
>> I work on HBase, a distributed database written in Java. We generally
>> run on large heaps (8GB+), and our object lifetime distribution has
>> proven pretty problematic for garbage collection (we manage a multi-GB
>> LRU cache inside the process, so in CMS we tenure a lot of byte arrays
>> which later get collected).
>>
>> In Java6, we generally run with the CMS collector, tuning down
>> CMSInitiatingOccupancyFraction and constraining MaxNewSize to achieve
>> fairly low pause GC, but after a week or two of uptime we often run into
>> full heap compaction which takes several minutes and wreaks havoc on the
>> system.
>>
>> Needless to say, we've been watching the development of the G1 GC with
>> anticipation for the last year or two. Finally it seems in the latest
>> build of JDK7 it's stable enough for actual use (previously it would
>> segfault within an hour or so). However, in my testing I'm seeing a fair
>> amount of 8-10 second Full GC pauses.
>>
>> The flags I'm passing are:
>>
>> -Xmx8000m -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:GCPauseIntervalMillis=80
>>
>> Most of the pauses I see in the GC log are around 10-20ms as expected:
>>
>> 2010-07-05T22:43:19.849-0700: 1680.079: [GC pause (partial), 0.01209000
>> secs]
>> [Parallel Time: 10.5 ms]
>> [Update RS (Start) (ms): 1680080.2 1680080.1 1680080.2
>> 1680079.9 1680080.0 1680080.2 1680080.1 1680080.1 1680080.0
>> 1680080.1 1680080.0 1680079.9 1680081.5]
>> [Update RS (ms): 1.4 2.0 2.2 1.8 1.7 1.4 2.5 2.2 1.9 2.5
>> 1.7 1.7 0.1
>> Avg: 1.8, Min: 0.1, Max: 2.5]
>> [Processed Buffers : 8 1 3 1 1 7 3 2 6 2 7 8 3
>> Sum: 52, Avg: 4, Min: 1, Max: 8]
>> [Ext Root Scanning (ms): 0.7 0.5 0.5 0.3 0.5 0.7 0.4 0.5
>> 0.4 0.5 0.3 0.3 0.0
>> Avg: 0.4, Min: 0.0, Max: 0.7]
>> [Mark Stack Scanning (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
>> 0.0 0.0 0.0 0.0 0.0
>> Avg: 0.0, Min: 0.0, Max: 0.0]
>> [Scan RS (ms): 0.9 0.4 0.1 0.7 0.7 0.9 0.0 0.1 0.6 0.0
>> 0.8 0.8 0.9
>> Avg: 0.5, Min: 0.0, Max: 0.9]
>> [Object Copy (ms): 7.2 7.2 7.3 7.3 7.1 7.1 7.0 7.2 7.1
>> 6.9 7.1 7.1 7.0
>> Avg: 7.1, Min: 6.9, Max: 7.3]
>> [Termination (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
>> 0.0 0.0 0.0 0.0
>> Avg: 0.0, Min: 0.0, Max: 0.0]
>> [Other: 0.6 ms]
>> [Clear CT: 0.5 ms]
>> [Other: 1.1 ms]
>> [Choose CSet: 0.0 ms]
>> [ 7677M->7636M(8000M)]
>> [Times: user=0.12 sys=0.00, real=0.01 secs]
>>
>> But every 5-10 minutes I see a GC pause that lasts 10-15 seconds:
>> [todd@monster01 logs]$ grep 'Full GC' gc-hbase.log | tail
>> 2010-07-06T12:50:41.216-0700: 52521.446: [Full GC 7934M->4865M(8000M),
>> 9.8907800 secs]
>> 2010-07-06T12:55:39.802-0700: 52820.032: [Full GC 7930M->4964M(8000M),
>> 9.9025520 secs]
>> 2010-07-06T13:02:26.872-0700: 53227.102: [Full GC 7934M->4882M(8000M),
>> 10.1232190 secs]
>> 2010-07-06T13:09:41.049-0700: 53661.279: [Full GC 7938M->5002M(8000M),
>> 10.4997760 secs]
>> 2010-07-06T13:18:51.531-0700: 54211.761: [Full GC 7938M->4962M(8000M),
>> 11.0497380 secs]
>>
>> These pauses are pretty unacceptable for soft real time operation.
>>
>> Am I missing some tuning that should be done for G1GC for applications
>> like this? Is 20ms out of 80ms too aggressive a target for the garbage
>> rates we're generating?
>>
>> My actual live heap usage should be very stable around 5GB - the
>> application very carefully accounts its live object set at around 60% of
>> the max heap (as you can see in the logs above).
>>
>> At this point we are considering doing crazy things like ripping out our
>> main memory consumers into a custom slab allocator, and manually
>> reference count the byte array slices. But, if we can get G1GC to work
>> for us, it will save a lot of engineering on the application side!
>>
>> Thanks
>> -Todd
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
>>
>> ------------------------------------------------------------------------
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
Two questions:
(1) do you enable class unloading with CMS? (I do not see that
below in yr option list, but wondered.)
(2) does your application load classes, or intern a lot of strings?
If i am reading the logs right, G1 appears to reclaim less
and less of the heap in each cycle until a full collection
intervenes, and I have no real explanation for this behaviour
except that perhaps there's something in the perm gen that
keeps stuff in the remainder of the heap artificially live.
G1 does not incrementally collect the young gen, so this is
plausible. But CMS does not either by default and I do not see
that option in the CMS options list you gave below. It would
be instructive to see what the comparable CMS logs look like.
May be then you could start with the same heap shapes for the
two and see if you can get to the bottom of the full gc (which
as i understand you get to more quickly w/G1 than you did
w/CMS).
-- ramki
On 07/06/10 14:24, Todd Lipcon wrote:
> On Tue, Jul 6, 2010 at 2:09 PM, Jon Masamitsu <
> > wrote:
>
> Todd,
>
> Could you send a segment of the GC logs from the beginning
> through the first dozen or so full GC's?
>
>
> Sure, I just put it online at:
>
> http://cloudera-todd.s3.amazonaws.com/gc-log-g1gc.txt
>
>
>
> Exactly which version of the JVM are you using?
>
> java -version
>
> will tell us.
>
>
> Latest as of last night:
>
> [todd@monster01 ~]$ ./jdk1.7.0/jre/bin/java -version
> java version "1.7.0-ea"
> Java(TM) SE Runtime Environment (build 1.7.0-ea-b99)
> Java HotSpot(TM) 64-Bit Server VM (build 19.0-b03, mixed mode)
>
>
> Do you have a test setup where you could do some experiments?
>
>
> Sure, I have a five node cluster here where I do lots of testing, happy
> to try different builds/options/etc (though I probably don't have time
> to apply patches and rebuild the JDK myself)
>
>
> Can you send the set of CMS flags you use? It might tell
> us something about the GC behavior of you application.
> Might not tell us anything but it's worth a look.
>
>
> Different customers have found different flags to work well for them.
> One user uses the following:
>
>
> -Xmx12000m -XX:+HeapDumpOnOutOfMemoryError -XX:+UseConcMarkSweepGC \
> -XX:NewSize=64m -XX:MaxNewSize=64m -XX:CMSInitiatingOccupancyFraction=40 \
>
>
> Another uses:
>
>
> -XX:+DoEscapeAnalysis -XX:+AggressiveOpts -XX:+UseConcMarkSweepGC
> -XX:NewSize=64m -XX:MaxNewSize=64m
> -XX:CMSInitiatingOccupancyFraction=88 -verbose:gc -XX:+PrintGCDetails
>
>
>
>
>
>
>
> The particular tuning options probably depend on the actual cache
> workload of the user. I tend to recommend CMSInitiatingOccupancyFraction
> around 75 or so, since the software maintains about 60% heap usage. I
> also think a NewSize slightly larger would improve things a bit, but if
> it gets more than 256m or so, the ParNew pauses start to be too long for
> a lot of use cases.
>
> Regarding CMS logs, I can probably restart this test later this
> afternoon on CMS and run it for a couple hours, but it isn't likely to
> hit the multi-minute compaction that quickly. It happens more in the wild.
>
> -Todd
>
>
>
> On 07/06/10 13:27, Todd Lipcon wrote:
>> Hi all,
>>
>> I work on HBase, a distributed database written in Java. We
>> generally run on large heaps (8GB+), and our object lifetime
>> distribution has proven pretty problematic for garbage collection
>> (we manage a multi-GB LRU cache inside the process, so in CMS we
>> tenure a lot of byte arrays which later get collected).
>>
>> In Java6, we generally run with the CMS collector, tuning down
>> CMSInitiatingOccupancyFraction and constraining MaxNewSize to
>> achieve fairly low pause GC, but after a week or two of uptime we
>> often run into full heap compaction which takes several minutes
>> and wreaks havoc on the system.
>>
>> Needless to say, we've been watching the development of the G1 GC
>> with anticipation for the last year or two. Finally it seems in
>> the latest build of JDK7 it's stable enough for actual use
>> (previously it would segfault within an hour or so). However, in
>> my testing I'm seeing a fair amount of 8-10 second Full GC pauses.
>>
>> The flags I'm passing are:
>>
>> -Xmx8000m -XX:+UseG1GC -XX:MaxGCPauseMillis=20
>> -XX:GCPauseIntervalMillis=80
>>
>> Most of the pauses I see in the GC log are around 10-20ms as expected:
>>
>> 2010-07-05T22:43:19.849-0700: 1680.079: [GC pause (partial),
>> 0.01209000 secs]
>> [Parallel Time: 10.5 ms]
>> [Update RS (Start) (ms): 1680080.2 1680080.1 1680080.2
>> 1680079.9 1680080.0 1680080.2 1680080.1 1680080.1 1680080.0
>> 1680080.1 1680080.0 1680079.9 1680081.5]
>> [Update RS (ms): 1.4 2.0 2.2 1.8 1.7 1.4 2.5 2.2
>> 1.9 2.5 1.7 1.7 0.1
>> Avg: 1.8, Min: 0.1, Max: 2.5]
>> [Processed Buffers : 8 1 3 1 1 7 3 2 6 2 7 8 3
>> Sum: 52, Avg: 4, Min: 1, Max: 8]
>> [Ext Root Scanning (ms): 0.7 0.5 0.5 0.3 0.5 0.7 0.4
>> 0.5 0.4 0.5 0.3 0.3 0.0
>> Avg: 0.4, Min: 0.0, Max: 0.7]
>> [Mark Stack Scanning (ms): 0.0 0.0 0.0 0.0 0.0 0.0
>> 0.0 0.0 0.0 0.0 0.0 0.0 0.0
>> Avg: 0.0, Min: 0.0, Max: 0.0]
>> [Scan RS (ms): 0.9 0.4 0.1 0.7 0.7 0.9 0.0 0.1 0.6
>> 0.0 0.8 0.8 0.9
>> Avg: 0.5, Min: 0.0, Max: 0.9]
>> [Object Copy (ms): 7.2 7.2 7.3 7.3 7.1 7.1 7.0 7.2
>> 7.1 6.9 7.1 7.1 7.0
>> Avg: 7.1, Min: 6.9, Max: 7.3]
>> [Termination (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
>> 0.0 0.0 0.0 0.0 0.0
>> Avg: 0.0, Min: 0.0, Max: 0.0]
>> [Other: 0.6 ms]
>> [Clear CT: 0.5 ms]
>> [Other: 1.1 ms]
>> [Choose CSet: 0.0 ms]
>> [ 7677M->7636M(8000M)]
>> [Times: user=0.12 sys=0.00, real=0.01 secs]
>>
>> But every 5-10 minutes I see a GC pause that lasts 10-15 seconds:
>> [todd@monster01 logs]$ grep 'Full GC' gc-hbase.log | tail
>> 2010-07-06T12:50:41.216-0700: 52521.446: [Full GC
>> 7934M->4865M(8000M), 9.8907800 secs]
>> 2010-07-06T12:55:39.802-0700: 52820.032: [Full GC
>> 7930M->4964M(8000M), 9.9025520 secs]
>> 2010-07-06T13:02:26.872-0700: 53227.102: [Full GC
>> 7934M->4882M(8000M), 10.1232190 secs]
>> 2010-07-06T13:09:41.049-0700: 53661.279: [Full GC
>> 7938M->5002M(8000M), 10.4997760 secs]
>> 2010-07-06T13:18:51.531-0700: 54211.761: [Full GC
>> 7938M->4962M(8000M), 11.0497380 secs]
>>
>> These pauses are pretty unacceptable for soft real time operation.
>>
>> Am I missing some tuning that should be done for G1GC for
>> applications like this? Is 20ms out of 80ms too aggressive a
>> target for the garbage rates we're generating?
>>
>> My actual live heap usage should be very stable around 5GB - the
>> application very carefully accounts its live object set at around
>> 60% of the max heap (as you can see in the logs above).
>>
>> At this point we are considering doing crazy things like ripping
>> out our main memory consumers into a custom slab allocator, and
>> manually reference count the byte array slices. But, if we can get
>> G1GC to work for us, it will save a lot of engineering on the
>> application side!
>>
>> Thanks
>> -Todd
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>> ------------------------------------------------------------------------
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
On 07/07/10 08:45, Todd Lipcon wrote:
...
>
> Overnight I saw one "concurrent mode failure".
...
> 2010-07-07T07:56:27.786-0700: 28490.203: [GC 28490.203: [ParNew
> (promotion failed): 59008K->59008K(59008K), 0.0179250 secs]28490.221:
> [CMS2010-07-07T07:56:27.901-0700: 28490.317: [CMS-concurrent-preclean:
> 0.556/0.947 secs] [Times:
> user=5.76 sys=0.26, real=0.95 secs]
> (concurrent mode failure): 6359176K->4206871K(8323072K), 17.4366220
> secs] 6417373K->4206871K(8382080K), [CMS Perm : 18609K->18565K(31048K)],
> 17.4546890 secs] [Times: user=11.17 sys=0.09, real=17.45 secs]
>
> I've interpreted pauses like this as being caused by fragmentation,
> since the young gen is 64M, and the old gen here has about 2G free. If
> there's something I'm not understanding about CMS, and I can tune it
> more smartly to avoid these longer pauses, I'm happy to try.
Yes the old gen must be fragmented. I'll look at the data you have
made available (for CMS). The CMS log you uploaded does not have the
suffix leading into the concurrent mode failure ypu display above
(it stops less than 2500 s into the run). If you could include
the entire log leading into the concurrent mode failures, it would
be a great help. Do you have large arrays in your
application? The shape of the promotion graph for CMS is somewhat
jagged, indicating _perhaps_ that. Yes, +PrintTenuringDistribution
would shed a bit more light. As regards fragmentation, it can be
tricky to tune against, but we can try once we understand a bit
more about the object sizes and demographics.
I am sure you don't have an easily shared test case, so we
can reproduce both the CMS fragmentation and the G1 full gc
issues locally for quickest progress on this?
-- ramki
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
On 07/07/10 17:18, Todd Lipcon wrote:
...
> Looking at the graph you attached, it appears that the low-water mark
> stabilizes at somewhere between 4.5G and 5G. The configuration I'm
> running is to allocate 40% of the heap to Memstore and 20% of the heap
> to the LRU cache. For an 8G heap, this is 4.8GB. So, for this
> application it's somewhat expected that, as it runs, it will accumulate
> more and more data until it reaches this threshold. The data is, of
> course, not *permanent*, but it's reasonably long-lived, so it makes
> sense to me that it should go into the old generation.
Ah, i see. In that case, i think you could try using a slightly larger old
gen. If the old gen stabilizes at 4.2 GB, we should allow as much for slop.
i.e. make the old gen 8.4 GB (or whatever is the measured stable
old gen occupancy), then add to that the young gen size, and use
that for the whole heap. I would be even more aggressive
and grant more to the old gen -- as i said earlier perhaps
double the old gen from its present size. If that doesn;t work
we know that something is amiss in the way we are going at this.
If it works, we can iterate downwards from a config that we know
works, down to what may be considered an acceptable space overhead
for GC.
>
> If you like, I can tune down those percentages to 20/20 instead of
> 20/40, and I think we'll see the same pattern, just stabilized around
> 3.2GB. This will probably delay the full GCs, but still eventually hit
> them. It's also way lower than we can really go - customers won't like
> "throwing away" 60% of the allocated heap to GC!
I understand that sentiment. I want us to get to a state where we are able
to completely avoid the creeping fragmentation, if possible. There are
other ways to tune for this, but they are more labour-intensive and tricky,
and I would not want to go into that lightly. You might want to contact
your Java support for help with that.
>
>
> Perhaps jmap -histo:live or +PrintClassHistogram[AfterFullGC]
> will help you get to the bottom of yr leak. Once the leak is plugged
> perhaps we could come back to the G1 tuning effort? (We have some
> guesses as to what might be happening and the best G1 minds are
> chewing on the info you provided so far, for which thanks!)
>
>
> I can try running with those options and see what I see, but I've
> already spent some time looking at heap dumps, and not found any leaks,
> so I'm pretty sure it's not the issue.
OK, in that case it's not worth doing, since you've already ruled
out leaks.
I'll think some more about this meanwhile.
thanks.
-- ramki
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
On 07/07/10 17:32, Todd Lipcon wrote:
...
> OK, I can try some tests with cache configured for only 40% heap usage.
> Should I run these tests with CMS or G1?
I'd first try CMS, and if that works, try G1.
>
>
>
>
> If you like, I can tune down those percentages to 20/20 instead
> of 20/40, and I think we'll see the same pattern, just
> stabilized around 3.2GB. This will probably delay the full GCs,
> but still eventually hit them. It's also way lower than we can
> really go - customers won't like "throwing away" 60% of the
> allocated heap to GC!
>
>
> I understand that sentiment. I want us to get to a state where we
> are able
> to completely avoid the creeping fragmentation, if possible. There are
> other ways to tune for this, but they are more labour-intensive and
> tricky,
> and I would not want to go into that lightly. You might want to contact
> your Java support for help with that.
>
>
> Yep, we've considered various solutions involving managing our own
> ref-counted slices of a single pre-allocated byte array - essentially
> writing our own slab allocator. In theory this should make all of the
> GCable objects constrained to a small number of sizes, and thus prevent
> fragmentation, but it's quite a project to undertake :)
That would be overdoing it. I didn't mean anything so drastic and certainly
nothing so drastic at the application level. When I said "labour intensive"
I meant tuning GC to avoid that kind of fragmentation would be more work.
>
> Regarding Java support, as an open source project we have no such
> luxury. Projects like HBase and Hadoop, though, are pretty visible to
> users as "big Java apps", so getting them working well on the GC front
> does good things for Java adoption in the database/distributed systems
> community, I think.
I agree, and we certainly should.
-- ramki
>
> -Todd
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>
>
> ------------------------------------------------------------------------
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
> Am I missing some tuning that should be done for G1GC for applications like
> this? Is 20ms out of 80ms too aggressive a target for the garbage rates
> we're generating?
I have never run HBase, but in an LRU stress test (I posted about it a
few months ago) I specifically observed remembered set scanning costs
go way up. In addition I was seeing fallbacks to full GC:s recently in
a slightly different test that I also posed about to -use, and that
turned out to be a result of the estimated rset scanning costs being
so high that regions were never selected for eviction even though they
had very little live data. I would be very interested to hear if
you're having the same problem. My last post on the topic is here:
http://mail.openjdk.java.net/pipermail/hotspot-gc-use/2010-June/000652.html
Including the link to the (throw-away) patch that should tell you
whether this is what's happening:
http://distfiles.scode.org/mlref/g1/g1_region_live_stats_hack.patch
Out of personal curiosity I'd be very interested to hear whether this
is what's happening to you (in a real reasonable use-case rather than
a synthetic benchmark).
My sense (and hotspot/g1 developers please smack me around if I am
misrepresenting anything here) is that the effect I saw (with rset
scanning costs) could cause perpetual memory grow (until fallback to
full GC) in two ways:
(1) The estimated (and possibly real) cost of rset scanning for a
single region could be so high that it is never possible to select it
for eviction given the asked for pause time goals. Hence, such a
region effectively "leaks" until full GC.
(2) The estimated (and possibly real) cost of rset scanning for
regions may be so high that there are, in practice, always other
regions selected for high pay-off/cost ratios, such that they end up
never being collected even if theoretically a single region could be
evicted within the pause time goal.
These are effectively the same thing, with (1) being an extreme case of (2).
In both cases, the effect should be mitigated (and have been in the
case where I did my testing), but as far as I can tell not generally
"fixed", by increasing the pause time goals.
It is unclear to me how this is intended to be handled. The original
g1 paper mentions an rset scanning thread that I may suspect would be
intended to help do rset scanning in the background such that regions
like these could be evicted more cheaply during the STW eviction
pause; but I didn't find such a thread anywhere in the source code -
but I may very well just be missing it.
--
/ Peter Schuller
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
Hi Peter --
Yes, my guess was also that something (possibly along the lines
you stated below) was preventing the selection of certain (sets
of) regions for evacuation on a regular basis ... I am told there
are flags that will allow you to get verbose details on what is
or is not selected for inclusion in the collection set; perhaps
that will help you get down to the bottom of this. Did you say
you had a test case that showed this behaviour? Filing a bug
with that test case may be the quickest way to get this before
the right set of eyes. Over to the G1 cognoscenti.
-- ramki
On 07/12/10 09:02, Peter Schuller wrote:
>> Am I missing some tuning that should be done for G1GC for applications like
>> this? Is 20ms out of 80ms too aggressive a target for the garbage rates
>> we're generating?
>
> I have never run HBase, but in an LRU stress test (I posted about it a
> few months ago) I specifically observed remembered set scanning costs
> go way up. In addition I was seeing fallbacks to full GC:s recently in
> a slightly different test that I also posed about to -use, and that
> turned out to be a result of the estimated rset scanning costs being
> so high that regions were never selected for eviction even though they
> had very little live data. I would be very interested to hear if
> you're having the same problem. My last post on the topic is here:
>
> http://mail.openjdk.java.net/pipermail/hotspot-gc-use/2010-June/000652.html
>
> Including the link to the (throw-away) patch that should tell you
> whether this is what's happening:
>
> http://distfiles.scode.org/mlref/g1/g1_region_live_stats_hack.patch
>
> Out of personal curiosity I'd be very interested to hear whether this
> is what's happening to you (in a real reasonable use-case rather than
> a synthetic benchmark).
>
> My sense (and hotspot/g1 developers please smack me around if I am
> misrepresenting anything here) is that the effect I saw (with rset
> scanning costs) could cause perpetual memory grow (until fallback to
> full GC) in two ways:
>
> (1) The estimated (and possibly real) cost of rset scanning for a
> single region could be so high that it is never possible to select it
> for eviction given the asked for pause time goals. Hence, such a
> region effectively "leaks" until full GC.
>
> (2) The estimated (and possibly real) cost of rset scanning for
> regions may be so high that there are, in practice, always other
> regions selected for high pay-off/cost ratios, such that they end up
> never being collected even if theoretically a single region could be
> evicted within the pause time goal.
>
> These are effectively the same thing, with (1) being an extreme case of (2).
>
> In both cases, the effect should be mitigated (and have been in the
> case where I did my testing), but as far as I can tell not generally
> "fixed", by increasing the pause time goals.
>
> It is unclear to me how this is intended to be handled. The original
> g1 paper mentions an rset scanning thread that I may suspect would be
> intended to help do rset scanning in the background such that regions
> like these could be evicted more cheaply during the STW eviction
> pause; but I didn't find such a thread anywhere in the source code -
> but I may very well just be missing it.
>
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
Peter and Todd,
Any chance of setting +PrintHeapAtGC and -XX:+PrintHeapAtGCExtended and
sending us the log, or part of it (say between two Full GCs)? Be
prepared: this will generate piles of output. But it will give us
per-region information that might shed more light on the cause of the
issue.... thanks,
Tony, HS GC Group
Peter Schuller wrote:
>> Am I missing some tuning that should be done for G1GC for applications like
>> this? Is 20ms out of 80ms too aggressive a target for the garbage rates
>> we're generating?
>>
>
> I have never run HBase, but in an LRU stress test (I posted about it a
> few months ago) I specifically observed remembered set scanning costs
> go way up. In addition I was seeing fallbacks to full GC:s recently in
> a slightly different test that I also posed about to -use, and that
> turned out to be a result of the estimated rset scanning costs being
> so high that regions were never selected for eviction even though they
> had very little live data. I would be very interested to hear if
> you're having the same problem. My last post on the topic is here:
>
> http://mail.openjdk.java.net/pipermail/hotspot-gc-use/2010-June/000652.html
>
> Including the link to the (throw-away) patch that should tell you
> whether this is what's happening:
>
> http://distfiles.scode.org/mlref/g1/g1_region_live_stats_hack.patch
>
> Out of personal curiosity I'd be very interested to hear whether this
> is what's happening to you (in a real reasonable use-case rather than
> a synthetic benchmark).
>
> My sense (and hotspot/g1 developers please smack me around if I am
> misrepresenting anything here) is that the effect I saw (with rset
> scanning costs) could cause perpetual memory grow (until fallback to
> full GC) in two ways:
>
> (1) The estimated (and possibly real) cost of rset scanning for a
> single region could be so high that it is never possible to select it
> for eviction given the asked for pause time goals. Hence, such a
> region effectively "leaks" until full GC.
>
> (2) The estimated (and possibly real) cost of rset scanning for
> regions may be so high that there are, in practice, always other
> regions selected for high pay-off/cost ratios, such that they end up
> never being collected even if theoretically a single region could be
> evicted within the pause time goal.
>
> These are effectively the same thing, with (1) being an extreme case of (2).
>
> In both cases, the effect should be mitigated (and have been in the
> case where I did my testing), but as far as I can tell not generally
> "fixed", by increasing the pause time goals.
>
> It is unclear to me how this is intended to be handled. The original
> g1 paper mentions an rset scanning thread that I may suspect would be
> intended to help do rset scanning in the background such that regions
> like these could be evicted more cheaply during the STW eviction
> pause; but I didn't find such a thread anywhere in the source code -
> but I may very well just be missing it.
>
>
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
Ramki/Tony,
> Any chance of setting +PrintHeapAtGC and -XX:+PrintHeapAtGCExtended and
> sending us the log, or part of it (say between two Full GCs)? Be prepared:
> this will generate piles of output. But it will give us per-region
> information that might shed more light on the cause of the issue.... thanks,
So what I have in terms of data is (see footnotes for urls references in []):
(a) A patch[1] that prints some additional information about estimated
costs of region eviction, and disables the GC efficiency check that
normally terminates selection of regions. (Note: This is a throw-away
patch for debugging; it's not intended as a suggested change for
inclusion.)
(b) A log[2] showing the output of a test run I did just now, with
both your flags above and my patch enabled (but without disabling the
efficiency check). It shows fallback to full GC when the actual live
set size is 252 MB, and the maximum heap size is 2 GB (in other words,
~ 12% liveness). An easy way to find the point of full gc is to search
for the string 'full 1'.
(c) A file[3] with the effective VM options during the test.
(d) Instructions for how to run the test to reproduce it (I'll get to
that at the end; it's simplified relative to previously).
(e) Nature of the test.
Discussion:
WIth respect to region information: I originally tried it in response
to your recommendation earlier, but I found I did not see the
information I was after. Perhaps I was just misreading it, but I
mostly just saw either 0% or 100% fullness, and never the actual
liveness estimate as produced by the mark phase. In the log I am
referring to in this E-Mail, you can see that the last printout of
region information just before the live GC fits this pattern; I just
don't see anything that looks like legitimate liveness information
being printed. (I don't have time to dig back into it right now to
double-check what it's printing.)
If you scroll up from the point of the full gc until you find a bunch
of output starting with "predict_region_elapsed_time_ms" you see some
output resulting from the patch, with pretty extreme values such as:
predict_region_elapsed_time_ms: 34.378642ms total, 34.021154ms rs scan
(46542 cnum), 0.040069 copy time (20704 bytes), 0.317419 other time
predict_region_elapsed_time_ms: 45.020866ms total, 44.653222ms rs scan
(61087 cnum), 0.050225 copy time (25952 bytes), 0.317419 other time
predict_region_elapsed_time_ms: 16.250033ms total, 15.887065ms rs scan
(21734 cnum), 0.045550 copy time (23536 bytes), 0.317419 other time
predict_region_elapsed_time_ms: 226.942877ms total, 226.559163ms rs
scan (309940 cnum), 0.066296 copy time (34256 bytes), 0.317419 other
time
predict_region_elapsed_time_ms: 542.344828ms total, 541.954750ms rs
scan (741411 cnum), 0.072659 copy time (37544 bytes), 0.317419 other
time
predict_region_elapsed_time_ms: 668.595597ms total, 668.220877ms rs
scan (914147 cnum), 0.057301 copy time (29608 bytes), 0.317419 other
time
So in the most extreme case in the excerpt above, that's > half a
second of estimate rset scanning time for a single region with 914147
cards to be scanned. While not all are that extreme, lots and lots of
regions are very expensive and almost only due to rset scanning costs.
If you scroll down a bit to the first (and ONLY) partial that happened
after the statistics accumulating from the marking phase, we see more
output resulting form the patch. At the end, we see:
(picked region; 0.345890ms predicted; 1.317244ms remaining; 15kb
marked; 15kb maxlive; 1-1% liveness)
(393380 KB left in heap.)
(picked region; 0.345963ms predicted; 0.971354ms remaining; 15kb
marked; 15kb maxlive; 1-1% liveness)
(393365 KB left in heap.)
(picked region; 0.346036ms predicted; 0.625391ms remaining; 15kb
marked; 15kb maxlive; 1-1% liveness)
(393349 KB left in heap.)
(no more marked regions; next region too expensive (adaptive;
predicted 0.346036ms > remaining 0.279355ms))
So in other words, it picked a bunch of regions in order of "lowest
hanging fruit". The *least* low hanging fruit picked still had
liveness at 1%; in other words, there's plenty of further regions that
ideally should be collected because they contain almost no garbage
(ignoring the cost of collecting them).
In this case, it stopped picking regions because the next region to be
picked, though cheap, was the straw that broke the camel's neck and we
simply exceeded the alloted time for this particular GC.
However, after this partial completes, it reverts back to doing just
young gc:s. In other words, even though there's *plenty* of regions
with very low liveness, further partials aren't happening.
By applying this part of the patch:
- (adaptive_young_list_length() &&
+ (adaptive_young_list_length() && false && // scodetodo
I artificially force g1 to not fall back to doing young gc:s for
efficiency reasons. When I run with that change, I don't experience
the slow perpetual growth until fallback to full GC. If I remember
correctly though, the rset scanning cost is in fact high, but I don't
have details saved and I'm afraid I don't have time to re-run those
tests right now and compare numbers.
Reproducing it:
I made some changes and the test case should now hopefully be easy to
run assuming you have maven installed. The github project is at:
http://github.com/scode/httpgctest
There is a README, but the shortest possible instructions to
re-produce the test that I did:
git clone git://github.com/scode/httpgctest.git
cd httpgctest.git
git checkout 20100714_1 # grab from appropriate tag, in case I
change master
mvn package
HTTPGCTEST_LOGGC=gc.log ./run.sh
That should start the http server; then run concurrently:
while [ 1 ] ; do curl 'http://localhost:9191/dropdata?ratio=0.10' ;
curl 'http://localhost:9191/gendata?amount=25000' ; sleep 0.1 ; done
And then just wait and observe.
Nature of the test:
So the test if run as above will essentially reach a steady state of
equilibrium with about 25000 pieces of data in a clojure immutable
map. The result is that a significant amount of new data is being
allocated, but very little writing to old regions is happening. The
garbage generated is very well spread out over the entire heap because
it goes through all objects and drops 10% (the ratio=0.10) for each
iteration, after which it adds 25000 new items.
In other words; not a lot of old gen writing, but lots of writes to
the young gen referencing objects in the old gen.
[1] http://distfiles.scode.org/mlref/g1/g1_region_live_stats_hack.patch
[2] http://distfiles.scode.org/mlref/gctest/httpgctest-g1-fullgc-20100714/gc-fullfallback.log
[3] http://distfiles.scode.org/mlref/gctest/httpgctest-g1-fullgc-20100714/vmoptions.txt
--
/ Peter Schuller
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
> I consistently am seeing "Other time" estimates in the >40ms range. Given my
> pause time goal of 20ms, these estimates are I think excluding most of the
> regions from collectability. I haven't been able to dig around yet to figure
> out where the long estimate for "other time" is coming from - in the
> collections logged it sometimes shows fairly high "Other" but the "Choose
> CSet" component is very short.
(The following is wannabe speculation based on limited understanding
of the code, please take it with a grain of salt.)
My first thought here is swapping. My reading is that other time is
going to be the collection set selection time plus the collection set
free time (or at least intended to be). I think (am I wrong?) that
this should be really low under normal circumstances since no "bulk"
work is done really; in particular the *per-region* cost should be
low.
If the cost of these operations *per region* ended up being predicted
to > 40ms, I wonder if this was not due to swapping?
Additionally: As far as I can tell the estimated 'other' cost is based
on a history of the cost from previous GC:s and completely independent
of the particular region being evaluated.
Anyways, I suspect you've already confirmed that the system is not
actively swapping at the time of the fallback to full GC. But here is
one low-confidence hypothesis (it would be really great to hear from
one of the gc devs whether it is even remotely plausible):
* At some point in time, there was swapping happening affecting GC
operations such that the work done do gather stats and select regions
was slow (makes some sense since that should touch lots of distinct
regions and you don't need a lot of those memory accesses swapping to
accumulate quite a bit of time).
* This screwed up the 'other' cost history and thus the prediction,
possibly for both young and non-young regions.
* I believe young collections would never be entirely prevented due to
pause time goals, so here the cost history and thus predictions would
always have time to recover and you would not notice any effect
looking at the behavior of the system down the line.
* Non-young "other" cost was so high that non-young regions were never
selected. This in turn meant that additional cost history for the
"other" category was never recorded, preventing recovery from the
temporary swap storm.
* The end result is that no non-young regions are ever collected, and
you end up falling back to full GC once the young collections have
"leaked" enough garbage.
Thoughts, anyone?
--
/ Peter Schuller
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
> I consistently am seeing "Other time" estimates in the >40ms range. Given my
> pause time goal of 20ms, these estimates are I think excluding most of the
Btw, to test the hypothesis: When you say "constantly", are the times
in fact so consistent that it's either exactly the same or almost,
possibly being consistent with my proposed hypothesis that the
non-young "other" time is stuck? If the young other time is not stuck
I guess one might see some variation (I seem to get < 1 ms on my
machine) but not a lot at all in comparison to 40ms. If you're seeing
variation like 40-42 all the time, and it never decreasing
significantly after it reached the 40ms range, that would be
consistent with the hypothesis I believe.
--
/ Peter Schuller
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
> There shouldn't be any swapping during the tests - I've got RAM fairly
> carefully allocated and I believe swappiness was tuned down on those
> machines, though I will double check to be certain.
Does HBase mmap() significant amounts of memory for I/O purposes? I'm
not very familiar with HBase and a quick Googling didn't yield an
answer.
With extensive mmap():ed I/O, excessive swapping of the application
seems to be a common problem even with significant memory margins,
sometimes even with swapiness turned down to 0. I've seen it happen
under several circumstances, and based on reports on the
cassandra-user mailing list during the past couple of months it seems
I'm not alone.
To be sure I recommend checking actual swapping history (or at least
check that the absolute amount of memory swapped out is reasonable
over time).
> I'll try to read through your full email in detail while looking at the
> source and the G1 paper -- right now it's a bit above my head :)
Well, just to re-iterate though I have really only begun looking at it
myself and my ramblings may be completely off the mark.
> FWIW, my tests on JRockit JRRT's gcprio:deterministic collector didn't go
> much better - eventually it fell back to a full compaction which lasted 45
> seconds or so. HBase must be doing something that's really hard for GCs to
> deal with - either on the heuristics front or on the allocation pattern
> front.
Interesting. I don't know a lot about JRockit's implementation since
not a lot of information seems to be available. I did my LRU
micro-benchmark with a ~20-30 GB heap and JRockit. I could definitely
press it hard enough to cause a fallback, but that seemed to be
directly as a result of high allocation rates simply exceeding the
forward progress made by the GC (based on blackbox observation
anyway).
(The other problem was that the compaction pauses were never able to
complete; it seems compaction is O(n) with respect to the number of
objects being compacted, and I was unable to make it compact less than
1% per GC (because the command line option only accepted integral
percents), and with my object count the 1% was enough to hit the pause
time requirement so compaction was aborted every time. LIkely this
would have poor results over time as fragmentation becomes
significant.).
Does HBase go into periodic modes of very high allocation rate, or is
it fairly constant over time? I'm thinking that perhaps the concurrent
marking is just not triggered early enough and if large bursts of
allocations happen when the heap is relatively full, that might be the
triggering factor?
--
/ Peter Schuller
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
> Yep, I've seen JRRT also "abort compaction" on most compactions. I couldn't
> quite figure out how to tell it that it was fine to pause more often for
> compaction, so long as each pause was short.
FWIW, I got the impression at the time (but I don't remember why; I
think I was half-guessing based on assumptions about what it does and
several iterations through the documentation) that it was
fundamentally only *able* to do compaction during the stop-the-world
pause after a concurrent mark phase. I.e., I don't think you can make
it spread the work out (but I can most definitely be wrong).
--
/ Peter Schuller
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
|
# 16

22-01-2011 12:51 AM
|
|
|
Todd,
Could you send a segment of the GC logs from the beginning
through the first dozen or so full GC's?
Exactly which version of the JVM are you using?
java -version
will tell us.
Do you have a test setup where you could do some experiments?
Can you send the set of CMS flags you use? It might tell
us something about the GC behavior of you application.
Might not tell us anything but it's worth a look.
Jon
On 07/06/10 13:27, Todd Lipcon wrote:
> Hi all,
>
> I work on HBase, a distributed database written in Java. We generally
> run on large heaps (8GB+), and our object lifetime distribution has
> proven pretty problematic for garbage collection (we manage a multi-GB
> LRU cache inside the process, so in CMS we tenure a lot of byte arrays
> which later get collected).
>
> In Java6, we generally run with the CMS collector, tuning down
> CMSInitiatingOccupancyFraction and constraining MaxNewSize to achieve
> fairly low pause GC, but after a week or two of uptime we often run
> into full heap compaction which takes several minutes and wreaks havoc
> on the system.
>
> Needless to say, we've been watching the development of the G1 GC with
> anticipation for the last year or two. Finally it seems in the latest
> build of JDK7 it's stable enough for actual use (previously it would
> segfault within an hour or so). However, in my testing I'm seeing a
> fair amount of 8-10 second Full GC pauses.
>
> The flags I'm passing are:
>
> -Xmx8000m -XX:+UseG1GC -XX:MaxGCPauseMillis=20
> -XX:GCPauseIntervalMillis=80
>
> Most of the pauses I see in the GC log are around 10-20ms as expected:
>
> 2010-07-05T22:43:19.849-0700: 1680.079: [GC pause (partial),
> 0.01209000 secs]
> [Parallel Time: 10.5 ms]
> [Update RS (Start) (ms): 1680080.2 1680080.1 1680080.2
> 1680079.9 1680080.0 1680080.2 1680080.1 1680080.1 1680080.0
> 1680080.1 1680080.0 1680079.9 1680081.5]
> [Update RS (ms): 1.4 2.0 2.2 1.8 1.7 1.4 2.5 2.2 1.9
> 2.5 1.7 1.7 0.1
> Avg: 1.8, Min: 0.1, Max: 2.5]
> [Processed Buffers : 8 1 3 1 1 7 3 2 6 2 7 8 3
> Sum: 52, Avg: 4, Min: 1, Max: 8]
> [Ext Root Scanning (ms): 0.7 0.5 0.5 0.3 0.5 0.7 0.4 0.5
> 0.4 0.5 0.3 0.3 0.0
> Avg: 0.4, Min: 0.0, Max: 0.7]
> [Mark Stack Scanning (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0
> 0.0 0.0 0.0 0.0 0.0 0.0
> Avg: 0.0, Min: 0.0, Max: 0.0]
> [Scan RS (ms): 0.9 0.4 0.1 0.7 0.7 0.9 0.0 0.1 0.6 0.0
> 0.8 0.8 0.9
> Avg: 0.5, Min: 0.0, Max: 0.9]
> [Object Copy (ms): 7.2 7.2 7.3 7.3 7.1 7.1 7.0 7.2 7.1
> 6.9 7.1 7.1 7.0
> Avg: 7.1, Min: 6.9, Max: 7.3]
> [Termination (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
> 0.0 0.0 0.0 0.0
> Avg: 0.0, Min: 0.0, Max: 0.0]
> [Other: 0.6 ms]
> [Clear CT: 0.5 ms]
> [Other: 1.1 ms]
> [Choose CSet: 0.0 ms]
> [ 7677M->7636M(8000M)]
> [Times: user=0.12 sys=0.00, real=0.01 secs]
>
> But every 5-10 minutes I see a GC pause that lasts 10-15 seconds:
> [todd@monster01 logs]$ grep 'Full GC' gc-hbase.log | tail
> 2010-07-06T12:50:41.216-0700: 52521.446: [Full GC 7934M->4865M(8000M),
> 9.8907800 secs]
> 2010-07-06T12:55:39.802-0700: 52820.032: [Full GC 7930M->4964M(8000M),
> 9.9025520 secs]
> 2010-07-06T13:02:26.872-0700: 53227.102: [Full GC 7934M->4882M(8000M),
> 10.1232190 secs]
> 2010-07-06T13:09:41.049-0700: 53661.279: [Full GC 7938M->5002M(8000M),
> 10.4997760 secs]
> 2010-07-06T13:18:51.531-0700: 54211.761: [Full GC 7938M->4962M(8000M),
> 11.0497380 secs]
>
> These pauses are pretty unacceptable for soft real time operation.
>
> Am I missing some tuning that should be done for G1GC for applications
> like this? Is 20ms out of 80ms too aggressive a target for the garbage
> rates we're generating?
>
> My actual live heap usage should be very stable around 5GB - the
> application very carefully accounts its live object set at around 60%
> of the max heap (as you can see in the logs above).
>
> At this point we are considering doing crazy things like ripping out
> our main memory consumers into a custom slab allocator, and manually
> reference count the byte array slices. But, if we can get G1GC to work
> for us, it will save a lot of engineering on the application side!
>
> Thanks
> -Todd
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
> ------------------------------------------------------------------------
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
Did you try doubling the heap size? You might want to post a full
log so we can see what's happening between those full collections.
Also, If you have comparable CMS logs
all the better, as a known starting point. The full gc's almost
look like the heap got too full, so it must mean that incremental
collection is not keeping up with the rate of garbage generation.
Also, what's the JDK build you are running?
-- ramki
On 07/06/10 13:27, Todd Lipcon wrote:
> Hi all,
>
> I work on HBase, a distributed database written in Java. We generally
> run on large heaps (8GB+), and our object lifetime distribution has
> proven pretty problematic for garbage collection (we manage a multi-GB
> LRU cache inside the process, so in CMS we tenure a lot of byte arrays
> which later get collected).
>
> In Java6, we generally run with the CMS collector, tuning down
> CMSInitiatingOccupancyFraction and constraining MaxNewSize to achieve
> fairly low pause GC, but after a week or two of uptime we often run into
> full heap compaction which takes several minutes and wreaks havoc on the
> system.
>
> Needless to say, we've been watching the development of the G1 GC with
> anticipation for the last year or two. Finally it seems in the latest
> build of JDK7 it's stable enough for actual use (previously it would
> segfault within an hour or so). However, in my testing I'm seeing a fair
> amount of 8-10 second Full GC pauses.
>
> The flags I'm passing are:
>
> -Xmx8000m -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:GCPauseIntervalMillis=80
>
> Most of the pauses I see in the GC log are around 10-20ms as expected:
>
> 2010-07-05T22:43:19.849-0700: 1680.079: [GC pause (partial), 0.01209000
> secs]
> [Parallel Time: 10.5 ms]
> [Update RS (Start) (ms): 1680080.2 1680080.1 1680080.2
> 1680079.9 1680080.0 1680080.2 1680080.1 1680080.1 1680080.0
> 1680080.1 1680080.0 1680079.9 1680081.5]
> [Update RS (ms): 1.4 2.0 2.2 1.8 1.7 1.4 2.5 2.2 1.9 2.5
> 1.7 1.7 0.1
> Avg: 1.8, Min: 0.1, Max: 2.5]
> [Processed Buffers : 8 1 3 1 1 7 3 2 6 2 7 8 3
> Sum: 52, Avg: 4, Min: 1, Max: 8]
> [Ext Root Scanning (ms): 0.7 0.5 0.5 0.3 0.5 0.7 0.4 0.5
> 0.4 0.5 0.3 0.3 0.0
> Avg: 0.4, Min: 0.0, Max: 0.7]
> [Mark Stack Scanning (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
> 0.0 0.0 0.0 0.0 0.0
> Avg: 0.0, Min: 0.0, Max: 0.0]
> [Scan RS (ms): 0.9 0.4 0.1 0.7 0.7 0.9 0.0 0.1 0.6 0.0
> 0.8 0.8 0.9
> Avg: 0.5, Min: 0.0, Max: 0.9]
> [Object Copy (ms): 7.2 7.2 7.3 7.3 7.1 7.1 7.0 7.2 7.1
> 6.9 7.1 7.1 7.0
> Avg: 7.1, Min: 6.9, Max: 7.3]
> [Termination (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
> 0.0 0.0 0.0 0.0
> Avg: 0.0, Min: 0.0, Max: 0.0]
> [Other: 0.6 ms]
> [Clear CT: 0.5 ms]
> [Other: 1.1 ms]
> [Choose CSet: 0.0 ms]
> [ 7677M->7636M(8000M)]
> [Times: user=0.12 sys=0.00, real=0.01 secs]
>
> But every 5-10 minutes I see a GC pause that lasts 10-15 seconds:
> [todd@monster01 logs]$ grep 'Full GC' gc-hbase.log | tail
> 2010-07-06T12:50:41.216-0700: 52521.446: [Full GC 7934M->4865M(8000M),
> 9.8907800 secs]
> 2010-07-06T12:55:39.802-0700: 52820.032: [Full GC 7930M->4964M(8000M),
> 9.9025520 secs]
> 2010-07-06T13:02:26.872-0700: 53227.102: [Full GC 7934M->4882M(8000M),
> 10.1232190 secs]
> 2010-07-06T13:09:41.049-0700: 53661.279: [Full GC 7938M->5002M(8000M),
> 10.4997760 secs]
> 2010-07-06T13:18:51.531-0700: 54211.761: [Full GC 7938M->4962M(8000M),
> 11.0497380 secs]
>
> These pauses are pretty unacceptable for soft real time operation.
>
> Am I missing some tuning that should be done for G1GC for applications
> like this? Is 20ms out of 80ms too aggressive a target for the garbage
> rates we're generating?
>
> My actual live heap usage should be very stable around 5GB - the
> application very carefully accounts its live object set at around 60% of
> the max heap (as you can see in the logs above).
>
> At this point we are considering doing crazy things like ripping out our
> main memory consumers into a custom slab allocator, and manually
> reference count the byte array slices. But, if we can get G1GC to work
> for us, it will save a lot of engineering on the application side!
>
> Thanks
> -Todd
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>
>
> ------------------------------------------------------------------------
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
I also work with Todd on this systems (I am one of the other people
with the alternate CMS config) and doubling the heap size from 8 GB to
16GB is a little insane... we'd like to have some amount of reasonable
memory efficiency here... The thing is the more we can get out of our
ram for this block cache, the better performing our systems are. Also
a lot of the settings are self tuning, so if we up the Xmx the size of
the block cache is scaled as well.
-ryan
On Tue, Jul 6, 2010 at 2:12 PM, Y. S. Ramakrishna
<> wrote:
> Did you try doubling the heap size? You might want to post a full
> log so we can see what's happening between those full collections.
> Also, If you have comparable CMS logs
> all the better, as a known starting point. The full gc's almost
> look like the heap got too full, so it must mean that incremental
> collection is not keeping up with the rate of garbage generation.
> Also, what's the JDK build you are running?
>
> -- ramki
>
> On 07/06/10 13:27, Todd Lipcon wrote:
>> Hi all,
>>
>> I work on HBase, a distributed database written in Java. We generally
>> run on large heaps (8GB+), and our object lifetime distribution has
>> proven pretty problematic for garbage collection (we manage a multi-GB
>> LRU cache inside the process, so in CMS we tenure a lot of byte arrays
>> which later get collected).
>>
>> In Java6, we generally run with the CMS collector, tuning down
>> CMSInitiatingOccupancyFraction and constraining MaxNewSize to achieve
>> fairly low pause GC, but after a week or two of uptime we often run into
>> full heap compaction which takes several minutes and wreaks havoc on the
>> system.
>>
>> Needless to say, we've been watching the development of the G1 GC with
>> anticipation for the last year or two. Finally it seems in the latest
>> build of JDK7 it's stable enough for actual use (previously it would
>> segfault within an hour or so). However, in my testing I'm seeing a fair
>> amount of 8-10 second Full GC pauses.
>>
>> The flags I'm passing are:
>>
>> -Xmx8000m -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:GCPauseIntervalMillis=80
>>
>> Most of the pauses I see in the GC log are around 10-20ms as expected:
>>
>> 2010-07-05T22:43:19.849-0700: 1680.079: [GC pause (partial), 0.01209000
>> secs]
>> [Parallel Time: 10.5 ms]
>> [Update RS (Start) (ms): 1680080.2 1680080.1 1680080.2
>> 1680079.9 1680080.0 1680080.2 1680080.1 1680080.1 1680080.0
>> 1680080.1 1680080.0 1680079.9 1680081.5]
>> [Update RS (ms): 1.4 2.0 2.2 1.8 1.7 1.4 2.5 2.2 1.9 2.5
>> 1.7 1.7 0.1
>> Avg: 1.8, Min: 0.1, Max: 2.5]
>> [Processed Buffers : 8 1 3 1 1 7 3 2 6 2 7 8 3
>> Sum: 52, Avg: 4, Min: 1, Max: 8]
>> [Ext Root Scanning (ms): 0.7 0.5 0.5 0.3 0.5 0.7 0.4 0.5
>> 0.4 0.5 0.3 0.3 0.0
>> Avg: 0.4, Min: 0.0, Max: 0.7]
>> [Mark Stack Scanning (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
>> 0.0 0.0 0.0 0.0 0.0
>> Avg: 0.0, Min: 0.0, Max: 0.0]
>> [Scan RS (ms): 0.9 0.4 0.1 0.7 0.7 0.9 0.0 0.1 0.6 0.0
>> 0.8 0.8 0.9
>> Avg: 0.5, Min: 0.0, Max: 0.9]
>> [Object Copy (ms): 7.2 7.2 7.3 7.3 7.1 7.1 7.0 7.2 7.1
>> 6.9 7.1 7.1 7.0
>> Avg: 7.1, Min: 6.9, Max: 7.3]
>> [Termination (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
>> 0.0 0.0 0.0 0.0
>> Avg: 0.0, Min: 0.0, Max: 0.0]
>> [Other: 0.6 ms]
>> [Clear CT: 0.5 ms]
>> [Other: 1.1 ms]
>> [Choose CSet: 0.0 ms]
>> [ 7677M->7636M(8000M)]
>> [Times: user=0.12 sys=0.00, real=0.01 secs]
>>
>> But every 5-10 minutes I see a GC pause that lasts 10-15 seconds:
>> [todd@monster01 logs]$ grep 'Full GC' gc-hbase.log | tail
>> 2010-07-06T12:50:41.216-0700: 52521.446: [Full GC 7934M->4865M(8000M),
>> 9.8907800 secs]
>> 2010-07-06T12:55:39.802-0700: 52820.032: [Full GC 7930M->4964M(8000M),
>> 9.9025520 secs]
>> 2010-07-06T13:02:26.872-0700: 53227.102: [Full GC 7934M->4882M(8000M),
>> 10.1232190 secs]
>> 2010-07-06T13:09:41.049-0700: 53661.279: [Full GC 7938M->5002M(8000M),
>> 10.4997760 secs]
>> 2010-07-06T13:18:51.531-0700: 54211.761: [Full GC 7938M->4962M(8000M),
>> 11.0497380 secs]
>>
>> These pauses are pretty unacceptable for soft real time operation.
>>
>> Am I missing some tuning that should be done for G1GC for applications
>> like this? Is 20ms out of 80ms too aggressive a target for the garbage
>> rates we're generating?
>>
>> My actual live heap usage should be very stable around 5GB - the
>> application very carefully accounts its live object set at around 60% of
>> the max heap (as you can see in the logs above).
>>
>> At this point we are considering doing crazy things like ripping out our
>> main memory consumers into a custom slab allocator, and manually
>> reference count the byte array slices. But, if we can get G1GC to work
>> for us, it will save a lot of engineering on the application side!
>>
>> Thanks
>> -Todd
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
>>
>> ------------------------------------------------------------------------
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
Two questions:
(1) do you enable class unloading with CMS? (I do not see that
below in yr option list, but wondered.)
(2) does your application load classes, or intern a lot of strings?
If i am reading the logs right, G1 appears to reclaim less
and less of the heap in each cycle until a full collection
intervenes, and I have no real explanation for this behaviour
except that perhaps there's something in the perm gen that
keeps stuff in the remainder of the heap artificially live.
G1 does not incrementally collect the young gen, so this is
plausible. But CMS does not either by default and I do not see
that option in the CMS options list you gave below. It would
be instructive to see what the comparable CMS logs look like.
May be then you could start with the same heap shapes for the
two and see if you can get to the bottom of the full gc (which
as i understand you get to more quickly w/G1 than you did
w/CMS).
-- ramki
On 07/06/10 14:24, Todd Lipcon wrote:
> On Tue, Jul 6, 2010 at 2:09 PM, Jon Masamitsu <
> > wrote:
>
> Todd,
>
> Could you send a segment of the GC logs from the beginning
> through the first dozen or so full GC's?
>
>
> Sure, I just put it online at:
>
> http://cloudera-todd.s3.amazonaws.com/gc-log-g1gc.txt
>
>
>
> Exactly which version of the JVM are you using?
>
> java -version
>
> will tell us.
>
>
> Latest as of last night:
>
> [todd@monster01 ~]$ ./jdk1.7.0/jre/bin/java -version
> java version "1.7.0-ea"
> Java(TM) SE Runtime Environment (build 1.7.0-ea-b99)
> Java HotSpot(TM) 64-Bit Server VM (build 19.0-b03, mixed mode)
>
>
> Do you have a test setup where you could do some experiments?
>
>
> Sure, I have a five node cluster here where I do lots of testing, happy
> to try different builds/options/etc (though I probably don't have time
> to apply patches and rebuild the JDK myself)
>
>
> Can you send the set of CMS flags you use? It might tell
> us something about the GC behavior of you application.
> Might not tell us anything but it's worth a look.
>
>
> Different customers have found different flags to work well for them.
> One user uses the following:
>
>
> -Xmx12000m -XX:+HeapDumpOnOutOfMemoryError -XX:+UseConcMarkSweepGC \
> -XX:NewSize=64m -XX:MaxNewSize=64m -XX:CMSInitiatingOccupancyFraction=40 \
>
>
> Another uses:
>
>
> -XX:+DoEscapeAnalysis -XX:+AggressiveOpts -XX:+UseConcMarkSweepGC
> -XX:NewSize=64m -XX:MaxNewSize=64m
> -XX:CMSInitiatingOccupancyFraction=88 -verbose:gc -XX:+PrintGCDetails
>
>
>
>
>
>
>
> The particular tuning options probably depend on the actual cache
> workload of the user. I tend to recommend CMSInitiatingOccupancyFraction
> around 75 or so, since the software maintains about 60% heap usage. I
> also think a NewSize slightly larger would improve things a bit, but if
> it gets more than 256m or so, the ParNew pauses start to be too long for
> a lot of use cases.
>
> Regarding CMS logs, I can probably restart this test later this
> afternoon on CMS and run it for a couple hours, but it isn't likely to
> hit the multi-minute compaction that quickly. It happens more in the wild.
>
> -Todd
>
>
>
> On 07/06/10 13:27, Todd Lipcon wrote:
>> Hi all,
>>
>> I work on HBase, a distributed database written in Java. We
>> generally run on large heaps (8GB+), and our object lifetime
>> distribution has proven pretty problematic for garbage collection
>> (we manage a multi-GB LRU cache inside the process, so in CMS we
>> tenure a lot of byte arrays which later get collected).
>>
>> In Java6, we generally run with the CMS collector, tuning down
>> CMSInitiatingOccupancyFraction and constraining MaxNewSize to
>> achieve fairly low pause GC, but after a week or two of uptime we
>> often run into full heap compaction which takes several minutes
>> and wreaks havoc on the system.
>>
>> Needless to say, we've been watching the development of the G1 GC
>> with anticipation for the last year or two. Finally it seems in
>> the latest build of JDK7 it's stable enough for actual use
>> (previously it would segfault within an hour or so). However, in
>> my testing I'm seeing a fair amount of 8-10 second Full GC pauses.
>>
>> The flags I'm passing are:
>>
>> -Xmx8000m -XX:+UseG1GC -XX:MaxGCPauseMillis=20
>> -XX:GCPauseIntervalMillis=80
>>
>> Most of the pauses I see in the GC log are around 10-20ms as expected:
>>
>> 2010-07-05T22:43:19.849-0700: 1680.079: [GC pause (partial),
>> 0.01209000 secs]
>> [Parallel Time: 10.5 ms]
>> [Update RS (Start) (ms): 1680080.2 1680080.1 1680080.2
>> 1680079.9 1680080.0 1680080.2 1680080.1 1680080.1 1680080.0
>> 1680080.1 1680080.0 1680079.9 1680081.5]
>> [Update RS (ms): 1.4 2.0 2.2 1.8 1.7 1.4 2.5 2.2
>> 1.9 2.5 1.7 1.7 0.1
>> Avg: 1.8, Min: 0.1, Max: 2.5]
>> [Processed Buffers : 8 1 3 1 1 7 3 2 6 2 7 8 3
>> Sum: 52, Avg: 4, Min: 1, Max: 8]
>> [Ext Root Scanning (ms): 0.7 0.5 0.5 0.3 0.5 0.7 0.4
>> 0.5 0.4 0.5 0.3 0.3 0.0
>> Avg: 0.4, Min: 0.0, Max: 0.7]
>> [Mark Stack Scanning (ms): 0.0 0.0 0.0 0.0 0.0 0.0
>> 0.0 0.0 0.0 0.0 0.0 0.0 0.0
>> Avg: 0.0, Min: 0.0, Max: 0.0]
>> [Scan RS (ms): 0.9 0.4 0.1 0.7 0.7 0.9 0.0 0.1 0.6
>> 0.0 0.8 0.8 0.9
>> Avg: 0.5, Min: 0.0, Max: 0.9]
>> [Object Copy (ms): 7.2 7.2 7.3 7.3 7.1 7.1 7.0 7.2
>> 7.1 6.9 7.1 7.1 7.0
>> Avg: 7.1, Min: 6.9, Max: 7.3]
>> [Termination (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
>> 0.0 0.0 0.0 0.0 0.0
>> Avg: 0.0, Min: 0.0, Max: 0.0]
>> [Other: 0.6 ms]
>> [Clear CT: 0.5 ms]
>> [Other: 1.1 ms]
>> [Choose CSet: 0.0 ms]
>> [ 7677M->7636M(8000M)]
>> [Times: user=0.12 sys=0.00, real=0.01 secs]
>>
>> But every 5-10 minutes I see a GC pause that lasts 10-15 seconds:
>> [todd@monster01 logs]$ grep 'Full GC' gc-hbase.log | tail
>> 2010-07-06T12:50:41.216-0700: 52521.446: [Full GC
>> 7934M->4865M(8000M), 9.8907800 secs]
>> 2010-07-06T12:55:39.802-0700: 52820.032: [Full GC
>> 7930M->4964M(8000M), 9.9025520 secs]
>> 2010-07-06T13:02:26.872-0700: 53227.102: [Full GC
>> 7934M->4882M(8000M), 10.1232190 secs]
>> 2010-07-06T13:09:41.049-0700: 53661.279: [Full GC
>> 7938M->5002M(8000M), 10.4997760 secs]
>> 2010-07-06T13:18:51.531-0700: 54211.761: [Full GC
>> 7938M->4962M(8000M), 11.0497380 secs]
>>
>> These pauses are pretty unacceptable for soft real time operation.
>>
>> Am I missing some tuning that should be done for G1GC for
>> applications like this? Is 20ms out of 80ms too aggressive a
>> target for the garbage rates we're generating?
>>
>> My actual live heap usage should be very stable around 5GB - the
>> application very carefully accounts its live object set at around
>> 60% of the max heap (as you can see in the logs above).
>>
>> At this point we are considering doing crazy things like ripping
>> out our main memory consumers into a custom slab allocator, and
>> manually reference count the byte array slices. But, if we can get
>> G1GC to work for us, it will save a lot of engineering on the
>> application side!
>>
>> Thanks
>> -Todd
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>> ------------------------------------------------------------------------
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
On 07/07/10 08:45, Todd Lipcon wrote:
...
>
> Overnight I saw one "concurrent mode failure".
...
> 2010-07-07T07:56:27.786-0700: 28490.203: [GC 28490.203: [ParNew
> (promotion failed): 59008K->59008K(59008K), 0.0179250 secs]28490.221:
> [CMS2010-07-07T07:56:27.901-0700: 28490.317: [CMS-concurrent-preclean:
> 0.556/0.947 secs] [Times:
> user=5.76 sys=0.26, real=0.95 secs]
> (concurrent mode failure): 6359176K->4206871K(8323072K), 17.4366220
> secs] 6417373K->4206871K(8382080K), [CMS Perm : 18609K->18565K(31048K)],
> 17.4546890 secs] [Times: user=11.17 sys=0.09, real=17.45 secs]
>
> I've interpreted pauses like this as being caused by fragmentation,
> since the young gen is 64M, and the old gen here has about 2G free. If
> there's something I'm not understanding about CMS, and I can tune it
> more smartly to avoid these longer pauses, I'm happy to try.
Yes the old gen must be fragmented. I'll look at the data you have
made available (for CMS). The CMS log you uploaded does not have the
suffix leading into the concurrent mode failure ypu display above
(it stops less than 2500 s into the run). If you could include
the entire log leading into the concurrent mode failures, it would
be a great help. Do you have large arrays in your
application? The shape of the promotion graph for CMS is somewhat
jagged, indicating _perhaps_ that. Yes, +PrintTenuringDistribution
would shed a bit more light. As regards fragmentation, it can be
tricky to tune against, but we can try once we understand a bit
more about the object sizes and demographics.
I am sure you don't have an easily shared test case, so we
can reproduce both the CMS fragmentation and the G1 full gc
issues locally for quickest progress on this?
-- ramki
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
On 07/07/10 17:18, Todd Lipcon wrote:
...
> Looking at the graph you attached, it appears that the low-water mark
> stabilizes at somewhere between 4.5G and 5G. The configuration I'm
> running is to allocate 40% of the heap to Memstore and 20% of the heap
> to the LRU cache. For an 8G heap, this is 4.8GB. So, for this
> application it's somewhat expected that, as it runs, it will accumulate
> more and more data until it reaches this threshold. The data is, of
> course, not *permanent*, but it's reasonably long-lived, so it makes
> sense to me that it should go into the old generation.
Ah, i see. In that case, i think you could try using a slightly larger old
gen. If the old gen stabilizes at 4.2 GB, we should allow as much for slop.
i.e. make the old gen 8.4 GB (or whatever is the measured stable
old gen occupancy), then add to that the young gen size, and use
that for the whole heap. I would be even more aggressive
and grant more to the old gen -- as i said earlier perhaps
double the old gen from its present size. If that doesn;t work
we know that something is amiss in the way we are going at this.
If it works, we can iterate downwards from a config that we know
works, down to what may be considered an acceptable space overhead
for GC.
>
> If you like, I can tune down those percentages to 20/20 instead of
> 20/40, and I think we'll see the same pattern, just stabilized around
> 3.2GB. This will probably delay the full GCs, but still eventually hit
> them. It's also way lower than we can really go - customers won't like
> "throwing away" 60% of the allocated heap to GC!
I understand that sentiment. I want us to get to a state where we are able
to completely avoid the creeping fragmentation, if possible. There are
other ways to tune for this, but they are more labour-intensive and tricky,
and I would not want to go into that lightly. You might want to contact
your Java support for help with that.
>
>
> Perhaps jmap -histo:live or +PrintClassHistogram[AfterFullGC]
> will help you get to the bottom of yr leak. Once the leak is plugged
> perhaps we could come back to the G1 tuning effort? (We have some
> guesses as to what might be happening and the best G1 minds are
> chewing on the info you provided so far, for which thanks!)
>
>
> I can try running with those options and see what I see, but I've
> already spent some time looking at heap dumps, and not found any leaks,
> so I'm pretty sure it's not the issue.
OK, in that case it's not worth doing, since you've already ruled
out leaks.
I'll think some more about this meanwhile.
thanks.
-- ramki
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
On 07/07/10 17:32, Todd Lipcon wrote:
...
> OK, I can try some tests with cache configured for only 40% heap usage.
> Should I run these tests with CMS or G1?
I'd first try CMS, and if that works, try G1.
>
>
>
>
> If you like, I can tune down those percentages to 20/20 instead
> of 20/40, and I think we'll see the same pattern, just
> stabilized around 3.2GB. This will probably delay the full GCs,
> but still eventually hit them. It's also way lower than we can
> really go - customers won't like "throwing away" 60% of the
> allocated heap to GC!
>
>
> I understand that sentiment. I want us to get to a state where we
> are able
> to completely avoid the creeping fragmentation, if possible. There are
> other ways to tune for this, but they are more labour-intensive and
> tricky,
> and I would not want to go into that lightly. You might want to contact
> your Java support for help with that.
>
>
> Yep, we've considered various solutions involving managing our own
> ref-counted slices of a single pre-allocated byte array - essentially
> writing our own slab allocator. In theory this should make all of the
> GCable objects constrained to a small number of sizes, and thus prevent
> fragmentation, but it's quite a project to undertake :)
That would be overdoing it. I didn't mean anything so drastic and certainly
nothing so drastic at the application level. When I said "labour intensive"
I meant tuning GC to avoid that kind of fragmentation would be more work.
>
> Regarding Java support, as an open source project we have no such
> luxury. Projects like HBase and Hadoop, though, are pretty visible to
> users as "big Java apps", so getting them working well on the GC front
> does good things for Java adoption in the database/distributed systems
> community, I think.
I agree, and we certainly should.
-- ramki
>
> -Todd
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>
>
> ------------------------------------------------------------------------
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
> Am I missing some tuning that should be done for G1GC for applications like
> this? Is 20ms out of 80ms too aggressive a target for the garbage rates
> we're generating?
I have never run HBase, but in an LRU stress test (I posted about it a
few months ago) I specifically observed remembered set scanning costs
go way up. In addition I was seeing fallbacks to full GC:s recently in
a slightly different test that I also posed about to -use, and that
turned out to be a result of the estimated rset scanning costs being
so high that regions were never selected for eviction even though they
had very little live data. I would be very interested to hear if
you're having the same problem. My last post on the topic is here:
http://mail.openjdk.java.net/pipermail/hotspot-gc-use/2010-June/000652.html
Including the link to the (throw-away) patch that should tell you
whether this is what's happening:
http://distfiles.scode.org/mlref/g1/g1_region_live_stats_hack.patch
Out of personal curiosity I'd be very interested to hear whether this
is what's happening to you (in a real reasonable use-case rather than
a synthetic benchmark).
My sense (and hotspot/g1 developers please smack me around if I am
misrepresenting anything here) is that the effect I saw (with rset
scanning costs) could cause perpetual memory grow (until fallback to
full GC) in two ways:
(1) The estimated (and possibly real) cost of rset scanning for a
single region could be so high that it is never possible to select it
for eviction given the asked for pause time goals. Hence, such a
region effectively "leaks" until full GC.
(2) The estimated (and possibly real) cost of rset scanning for
regions may be so high that there are, in practice, always other
regions selected for high pay-off/cost ratios, such that they end up
never being collected even if theoretically a single region could be
evicted within the pause time goal.
These are effectively the same thing, with (1) being an extreme case of (2).
In both cases, the effect should be mitigated (and have been in the
case where I did my testing), but as far as I can tell not generally
"fixed", by increasing the pause time goals.
It is unclear to me how this is intended to be handled. The original
g1 paper mentions an rset scanning thread that I may suspect would be
intended to help do rset scanning in the background such that regions
like these could be evicted more cheaply during the STW eviction
pause; but I didn't find such a thread anywhere in the source code -
but I may very well just be missing it.
--
/ Peter Schuller
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
Hi Peter --
Yes, my guess was also that something (possibly along the lines
you stated below) was preventing the selection of certain (sets
of) regions for evacuation on a regular basis ... I am told there
are flags that will allow you to get verbose details on what is
or is not selected for inclusion in the collection set; perhaps
that will help you get down to the bottom of this. Did you say
you had a test case that showed this behaviour? Filing a bug
with that test case may be the quickest way to get this before
the right set of eyes. Over to the G1 cognoscenti.
-- ramki
On 07/12/10 09:02, Peter Schuller wrote:
>> Am I missing some tuning that should be done for G1GC for applications like
>> this? Is 20ms out of 80ms too aggressive a target for the garbage rates
>> we're generating?
>
> I have never run HBase, but in an LRU stress test (I posted about it a
> few months ago) I specifically observed remembered set scanning costs
> go way up. In addition I was seeing fallbacks to full GC:s recently in
> a slightly different test that I also posed about to -use, and that
> turned out to be a result of the estimated rset scanning costs being
> so high that regions were never selected for eviction even though they
> had very little live data. I would be very interested to hear if
> you're having the same problem. My last post on the topic is here:
>
> http://mail.openjdk.java.net/pipermail/hotspot-gc-use/2010-June/000652.html
>
> Including the link to the (throw-away) patch that should tell you
> whether this is what's happening:
>
> http://distfiles.scode.org/mlref/g1/g1_region_live_stats_hack.patch
>
> Out of personal curiosity I'd be very interested to hear whether this
> is what's happening to you (in a real reasonable use-case rather than
> a synthetic benchmark).
>
> My sense (and hotspot/g1 developers please smack me around if I am
> misrepresenting anything here) is that the effect I saw (with rset
> scanning costs) could cause perpetual memory grow (until fallback to
> full GC) in two ways:
>
> (1) The estimated (and possibly real) cost of rset scanning for a
> single region could be so high that it is never possible to select it
> for eviction given the asked for pause time goals. Hence, such a
> region effectively "leaks" until full GC.
>
> (2) The estimated (and possibly real) cost of rset scanning for
> regions may be so high that there are, in practice, always other
> regions selected for high pay-off/cost ratios, such that they end up
> never being collected even if theoretically a single region could be
> evicted within the pause time goal.
>
> These are effectively the same thing, with (1) being an extreme case of (2).
>
> In both cases, the effect should be mitigated (and have been in the
> case where I did my testing), but as far as I can tell not generally
> "fixed", by increasing the pause time goals.
>
> It is unclear to me how this is intended to be handled. The original
> g1 paper mentions an rset scanning thread that I may suspect would be
> intended to help do rset scanning in the background such that regions
> like these could be evicted more cheaply during the STW eviction
> pause; but I didn't find such a thread anywhere in the source code -
> but I may very well just be missing it.
>
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
Peter and Todd,
Any chance of setting +PrintHeapAtGC and -XX:+PrintHeapAtGCExtended and
sending us the log, or part of it (say between two Full GCs)? Be
prepared: this will generate piles of output. But it will give us
per-region information that might shed more light on the cause of the
issue.... thanks,
Tony, HS GC Group
Peter Schuller wrote:
>> Am I missing some tuning that should be done for G1GC for applications like
>> this? Is 20ms out of 80ms too aggressive a target for the garbage rates
>> we're generating?
>>
>
> I have never run HBase, but in an LRU stress test (I posted about it a
> few months ago) I specifically observed remembered set scanning costs
> go way up. In addition I was seeing fallbacks to full GC:s recently in
> a slightly different test that I also posed about to -use, and that
> turned out to be a result of the estimated rset scanning costs being
> so high that regions were never selected for eviction even though they
> had very little live data. I would be very interested to hear if
> you're having the same problem. My last post on the topic is here:
>
> http://mail.openjdk.java.net/pipermail/hotspot-gc-use/2010-June/000652.html
>
> Including the link to the (throw-away) patch that should tell you
> whether this is what's happening:
>
> http://distfiles.scode.org/mlref/g1/g1_region_live_stats_hack.patch
>
> Out of personal curiosity I'd be very interested to hear whether this
> is what's happening to you (in a real reasonable use-case rather than
> a synthetic benchmark).
>
> My sense (and hotspot/g1 developers please smack me around if I am
> misrepresenting anything here) is that the effect I saw (with rset
> scanning costs) could cause perpetual memory grow (until fallback to
> full GC) in two ways:
>
> (1) The estimated (and possibly real) cost of rset scanning for a
> single region could be so high that it is never possible to select it
> for eviction given the asked for pause time goals. Hence, such a
> region effectively "leaks" until full GC.
>
> (2) The estimated (and possibly real) cost of rset scanning for
> regions may be so high that there are, in practice, always other
> regions selected for high pay-off/cost ratios, such that they end up
> never being collected even if theoretically a single region could be
> evicted within the pause time goal.
>
> These are effectively the same thing, with (1) being an extreme case of (2).
>
> In both cases, the effect should be mitigated (and have been in the
> case where I did my testing), but as far as I can tell not generally
> "fixed", by increasing the pause time goals.
>
> It is unclear to me how this is intended to be handled. The original
> g1 paper mentions an rset scanning thread that I may suspect would be
> intended to help do rset scanning in the background such that regions
> like these could be evicted more cheaply during the STW eviction
> pause; but I didn't find such a thread anywhere in the source code -
> but I may very well just be missing it.
>
>
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
Ramki/Tony,
> Any chance of setting +PrintHeapAtGC and -XX:+PrintHeapAtGCExtended and
> sending us the log, or part of it (say between two Full GCs)? Be prepared:
> this will generate piles of output. But it will give us per-region
> information that might shed more light on the cause of the issue.... thanks,
So what I have in terms of data is (see footnotes for urls references in []):
(a) A patch[1] that prints some additional information about estimated
costs of region eviction, and disables the GC efficiency check that
normally terminates selection of regions. (Note: This is a throw-away
patch for debugging; it's not intended as a suggested change for
inclusion.)
(b) A log[2] showing the output of a test run I did just now, with
both your flags above and my patch enabled (but without disabling the
efficiency check). It shows fallback to full GC when the actual live
set size is 252 MB, and the maximum heap size is 2 GB (in other words,
~ 12% liveness). An easy way to find the point of full gc is to search
for the string 'full 1'.
(c) A file[3] with the effective VM options during the test.
(d) Instructions for how to run the test to reproduce it (I'll get to
that at the end; it's simplified relative to previously).
(e) Nature of the test.
Discussion:
WIth respect to region information: I originally tried it in response
to your recommendation earlier, but I found I did not see the
information I was after. Perhaps I was just misreading it, but I
mostly just saw either 0% or 100% fullness, and never the actual
liveness estimate as produced by the mark phase. In the log I am
referring to in this E-Mail, you can see that the last printout of
region information just before the live GC fits this pattern; I just
don't see anything that looks like legitimate liveness information
being printed. (I don't have time to dig back into it right now to
double-check what it's printing.)
If you scroll up from the point of the full gc until you find a bunch
of output starting with "predict_region_elapsed_time_ms" you see some
output resulting from the patch, with pretty extreme values such as:
predict_region_elapsed_time_ms: 34.378642ms total, 34.021154ms rs scan
(46542 cnum), 0.040069 copy time (20704 bytes), 0.317419 other time
predict_region_elapsed_time_ms: 45.020866ms total, 44.653222ms rs scan
(61087 cnum), 0.050225 copy time (25952 bytes), 0.317419 other time
predict_region_elapsed_time_ms: 16.250033ms total, 15.887065ms rs scan
(21734 cnum), 0.045550 copy time (23536 bytes), 0.317419 other time
predict_region_elapsed_time_ms: 226.942877ms total, 226.559163ms rs
scan (309940 cnum), 0.066296 copy time (34256 bytes), 0.317419 other
time
predict_region_elapsed_time_ms: 542.344828ms total, 541.954750ms rs
scan (741411 cnum), 0.072659 copy time (37544 bytes), 0.317419 other
time
predict_region_elapsed_time_ms: 668.595597ms total, 668.220877ms rs
scan (914147 cnum), 0.057301 copy time (29608 bytes), 0.317419 other
time
So in the most extreme case in the excerpt above, that's > half a
second of estimate rset scanning time for a single region with 914147
cards to be scanned. While not all are that extreme, lots and lots of
regions are very expensive and almost only due to rset scanning costs.
If you scroll down a bit to the first (and ONLY) partial that happened
after the statistics accumulating from the marking phase, we see more
output resulting form the patch. At the end, we see:
(picked region; 0.345890ms predicted; 1.317244ms remaining; 15kb
marked; 15kb maxlive; 1-1% liveness)
(393380 KB left in heap.)
(picked region; 0.345963ms predicted; 0.971354ms remaining; 15kb
marked; 15kb maxlive; 1-1% liveness)
(393365 KB left in heap.)
(picked region; 0.346036ms predicted; 0.625391ms remaining; 15kb
marked; 15kb maxlive; 1-1% liveness)
(393349 KB left in heap.)
(no more marked regions; next region too expensive (adaptive;
predicted 0.346036ms > remaining 0.279355ms))
So in other words, it picked a bunch of regions in order of "lowest
hanging fruit". The *least* low hanging fruit picked still had
liveness at 1%; in other words, there's plenty of further regions that
ideally should be collected because they contain almost no garbage
(ignoring the cost of collecting them).
In this case, it stopped picking regions because the next region to be
picked, though cheap, was the straw that broke the camel's neck and we
simply exceeded the alloted time for this particular GC.
However, after this partial completes, it reverts back to doing just
young gc:s. In other words, even though there's *plenty* of regions
with very low liveness, further partials aren't happening.
By applying this part of the patch:
- (adaptive_young_list_length() &&
+ (adaptive_young_list_length() && false && // scodetodo
I artificially force g1 to not fall back to doing young gc:s for
efficiency reasons. When I run with that change, I don't experience
the slow perpetual growth until fallback to full GC. If I remember
correctly though, the rset scanning cost is in fact high, but I don't
have details saved and I'm afraid I don't have time to re-run those
tests right now and compare numbers.
Reproducing it:
I made some changes and the test case should now hopefully be easy to
run assuming you have maven installed. The github project is at:
http://github.com/scode/httpgctest
There is a README, but the shortest possible instructions to
re-produce the test that I did:
git clone git://github.com/scode/httpgctest.git
cd httpgctest.git
git checkout 20100714_1 # grab from appropriate tag, in case I
change master
mvn package
HTTPGCTEST_LOGGC=gc.log ./run.sh
That should start the http server; then run concurrently:
while [ 1 ] ; do curl 'http://localhost:9191/dropdata?ratio=0.10' ;
curl 'http://localhost:9191/gendata?amount=25000' ; sleep 0.1 ; done
And then just wait and observe.
Nature of the test:
So the test if run as above will essentially reach a steady state of
equilibrium with about 25000 pieces of data in a clojure immutable
map. The result is that a significant amount of new data is being
allocated, but very little writing to old regions is happening. The
garbage generated is very well spread out over the entire heap because
it goes through all objects and drops 10% (the ratio=0.10) for each
iteration, after which it adds 25000 new items.
In other words; not a lot of old gen writing, but lots of writes to
the young gen referencing objects in the old gen.
[1] http://distfiles.scode.org/mlref/g1/g1_region_live_stats_hack.patch
[2] http://distfiles.scode.org/mlref/gctest/httpgctest-g1-fullgc-20100714/gc-fullfallback.log
[3] http://distfiles.scode.org/mlref/gctest/httpgctest-g1-fullgc-20100714/vmoptions.txt
--
/ Peter Schuller
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
> I consistently am seeing "Other time" estimates in the >40ms range. Given my
> pause time goal of 20ms, these estimates are I think excluding most of the
> regions from collectability. I haven't been able to dig around yet to figure
> out where the long estimate for "other time" is coming from - in the
> collections logged it sometimes shows fairly high "Other" but the "Choose
> CSet" component is very short.
(The following is wannabe speculation based on limited understanding
of the code, please take it with a grain of salt.)
My first thought here is swapping. My reading is that other time is
going to be the collection set selection time plus the collection set
free time (or at least intended to be). I think (am I wrong?) that
this should be really low under normal circumstances since no "bulk"
work is done really; in particular the *per-region* cost should be
low.
If the cost of these operations *per region* ended up being predicted
to > 40ms, I wonder if this was not due to swapping?
Additionally: As far as I can tell the estimated 'other' cost is based
on a history of the cost from previous GC:s and completely independent
of the particular region being evaluated.
Anyways, I suspect you've already confirmed that the system is not
actively swapping at the time of the fallback to full GC. But here is
one low-confidence hypothesis (it would be really great to hear from
one of the gc devs whether it is even remotely plausible):
* At some point in time, there was swapping happening affecting GC
operations such that the work done do gather stats and select regions
was slow (makes some sense since that should touch lots of distinct
regions and you don't need a lot of those memory accesses swapping to
accumulate quite a bit of time).
* This screwed up the 'other' cost history and thus the prediction,
possibly for both young and non-young regions.
* I believe young collections would never be entirely prevented due to
pause time goals, so here the cost history and thus predictions would
always have time to recover and you would not notice any effect
looking at the behavior of the system down the line.
* Non-young "other" cost was so high that non-young regions were never
selected. This in turn meant that additional cost history for the
"other" category was never recorded, preventing recovery from the
temporary swap storm.
* The end result is that no non-young regions are ever collected, and
you end up falling back to full GC once the young collections have
"leaked" enough garbage.
Thoughts, anyone?
--
/ Peter Schuller
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
> I consistently am seeing "Other time" estimates in the >40ms range. Given my
> pause time goal of 20ms, these estimates are I think excluding most of the
Btw, to test the hypothesis: When you say "constantly", are the times
in fact so consistent that it's either exactly the same or almost,
possibly being consistent with my proposed hypothesis that the
non-young "other" time is stuck? If the young other time is not stuck
I guess one might see some variation (I seem to get < 1 ms on my
machine) but not a lot at all in comparison to 40ms. If you're seeing
variation like 40-42 all the time, and it never decreasing
significantly after it reached the 40ms range, that would be
consistent with the hypothesis I believe.
--
/ Peter Schuller
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
> There shouldn't be any swapping during the tests - I've got RAM fairly
> carefully allocated and I believe swappiness was tuned down on those
> machines, though I will double check to be certain.
Does HBase mmap() significant amounts of memory for I/O purposes? I'm
not very familiar with HBase and a quick Googling didn't yield an
answer.
With extensive mmap():ed I/O, excessive swapping of the application
seems to be a common problem even with significant memory margins,
sometimes even with swapiness turned down to 0. I've seen it happen
under several circumstances, and based on reports on the
cassandra-user mailing list during the past couple of months it seems
I'm not alone.
To be sure I recommend checking actual swapping history (or at least
check that the absolute amount of memory swapped out is reasonable
over time).
> I'll try to read through your full email in detail while looking at the
> source and the G1 paper -- right now it's a bit above my head :)
Well, just to re-iterate though I have really only begun looking at it
myself and my ramblings may be completely off the mark.
> FWIW, my tests on JRockit JRRT's gcprio:deterministic collector didn't go
> much better - eventually it fell back to a full compaction which lasted 45
> seconds or so. HBase must be doing something that's really hard for GCs to
> deal with - either on the heuristics front or on the allocation pattern
> front.
Interesting. I don't know a lot about JRockit's implementation since
not a lot of information seems to be available. I did my LRU
micro-benchmark with a ~20-30 GB heap and JRockit. I could definitely
press it hard enough to cause a fallback, but that seemed to be
directly as a result of high allocation rates simply exceeding the
forward progress made by the GC (based on blackbox observation
anyway).
(The other problem was that the compaction pauses were never able to
complete; it seems compaction is O(n) with respect to the number of
objects being compacted, and I was unable to make it compact less than
1% per GC (because the command line option only accepted integral
percents), and with my object count the 1% was enough to hit the pause
time requirement so compaction was aborted every time. LIkely this
would have poor results over time as fragmentation becomes
significant.).
Does HBase go into periodic modes of very high allocation rate, or is
it fairly constant over time? I'm thinking that perhaps the concurrent
marking is just not triggered early enough and if large bursts of
allocations happen when the heap is relatively full, that might be the
triggering factor?
--
/ Peter Schuller
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
> Yep, I've seen JRRT also "abort compaction" on most compactions. I couldn't
> quite figure out how to tell it that it was fine to pause more often for
> compaction, so long as each pause was short.
FWIW, I got the impression at the time (but I don't remember why; I
think I was half-guessing based on assumptions about what it does and
several iterations through the documentation) that it was
fundamentally only *able* to do compaction during the stop-the-world
pause after a concurrent mark phase. I.e., I don't think you can make
it spread the work out (but I can most definitely be wrong).
--
/ Peter Schuller
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
A bit more data. I did the following patch:
@@ -1560,6 +1575,19 @@
_non_young_other_cost_per_region_ms_seq->add(non_young_other_time_ms
/
(double)
_recorded_non_young_regions);
+ } else {
+ // no non-young gen collections - if our prediction is high enough,
we would
+ // never collect non-young again, so push it back towards zero so we
give it
+ // another try.
+ double predicted_other_time = predict_non_young_other_time_ms(1);
+ if (predicted_other_time > MaxGCPauseMillis/2.0) {
+ if (G1PolicyVerbose > 0) {
+ gclog_or_tty->print_cr(
+ "Predicted non-young other time %.1f is too large compared to
max pause time. Weighting down.",
+ predicted_other_time);
+ }
+ _non_young_other_cost_per_region_ms_seq->add(0.0);
+ }
}
and this mostly solved the problem described above. Now I get a full GC
every 45-50 minutes which is way improved from what it was before.
I still seem to be putting off GC of non-young regions too much though. I
did some analysis of the G1 log and made these graphs:
http://people.apache.org/~todd/hbase-fragmentation/g1-graphing.png
The top graph is a heat map of the number of young (pink color) and
non-young (blue) in each collection.
The middle graph is the post-collection heap usage over time in MB
The bottom graph is a heat map and smoothed line graph of the number of
millis spent per collection. The target in this case is 50ms.
A few interesting things:
- not sure what causes the sort of periodic striated pattern in the number
of young generation regions chosen
- most of the time no old gen regions are selected for collection at all!
Here's a graph of just old regions:
http://people.apache.org/~todd/hbase-fragmentation/old-regions.png
- When old regions are actually selected for collection the heap usage does
drop, though elapsed time does spike over the guarantee.
So seems like something about the heuristics aren't quite right. Thoughts?
-Todd
On Fri, Jan 21, 2011 at 11:38 AM, Todd Lipcon <> wrote:
> Hey folks,
>
> Took some time over the last day or two to follow up on this on the latest
> checkout of JDK7. I added some more instrumentation and my findings so far
> are:
>
> 1) CMS is definitely hitting a fragmentation problem. Our workload is
> pretty much guaranteed to fragment, and I don't think there's anything CMS
> can do about it - see the following graphs:
> http://people.apache.org/~todd/hbase-fragmentation/
>
> 2) G1GC is hitting
> full pauses because the "other" pause time predictions end up higher than
> the minimum pause length. I'm seeing the following sequence:
>
> - A single choose_cset operation for a non_young region takes a long time
> (unclear yet why this is happening, see below)
> - This inflates the predict_non_young_other_time_ms(1) result to a value
> greater than my pause goal
> - From then on, it doesn't collect any more non-young regions (ever!)
> because any region will be considered expensive regardless of the estimated
> rset or collection costs
> - The heap slowly fills up with non-young regions until we reach a full GC
>
> 3) So the question is why the choose_cset is taking a long time. I added
> getrusage() calls to wrap the choose_cset operation. Here's some output with
> extra logging:
>
> --> About to choose cset at 725.458
> Adding 1 young regions to the CSet
> Added 0x0000000000000001 Young Regions to CS.
> (3596288 KB left in heap.)
> (picked region; 9.948053ms predicted; 21.164738ms remaining; 2448kb
> marked; 2448kb maxlive; 59-59% liveness)
> (3593839 KB left in heap.)
> predict_region_elapsed_time_ms: 10.493828ms total, 5.486072ms rs scan
> (14528 cnum), 4.828102 copy time (2333800 bytes), 0.179654 other time
> (picked region; 10.493828ms predicted; 11.216685ms remaining; 2279kb
> marked; 2279kb maxlive; 55-55% liveness)
> predict_region_elapsed_time_ms: 10.493828ms total, 5.486072ms rs scan
> (14528 cnum), 4.828102 copy time (2333800 bytes), 0.179654 other time
> (3591560 KB left in heap.)
> predict_region_elapsed_time_ms: 10.346346ms total, 5.119780ms rs scan
> (13558 cnum), 5.046912 copy time (2439568 bytes), 0.179654 other time
> predict_region_elapsed_time_ms: 10.407672ms total, 5.333135ms rs scan
> (14123 cnum), 4.894882 copy time (2366080 bytes), 0.179654 other time
> (no more marked regions; next region too expensive (adaptive; predicted
> 10.407672ms > remaining 0.722857ms))
> Resource usage of choose_cset:majflt: 0 nswap: 0 nvcsw: 6 nivcsw: 0
> --> About to prepare RS scan at 725.657
>
> The resource usage line with nvcsw=6 indicates there were 6 voluntary
> context switches while choosing cset. This choose_cset operation took
> 198.9ms all in choosing non-young.
>
> So, why are there voluntary context switches while choosing cset? This
> isn't swapping -- that should show under majflt, right? My only theories
> are:
> - are any locks acquired in choose_cset?
> - maybe the gc logging itself is blocking on IO to the log file? ie the
> instrumentation itself is interfering with the algorithm?
>
>
> Regardless, I think a single length choose_non_young_cset operation
> shouldn't be allowed to push the prediction above the time boundary and
> trigger this issue. Perhaps a simple workaround is that, whenever a
> collection chooses no non_young regions, it should contribute a value of 0
> to the average?
>
> I'll give this heuristic a try on my build and see if it solves the issue.
>
> -Todd
>
> On Tue, Jul 27, 2010 at 3:08 PM, Todd Lipcon <> wrote:
>
>> Hi all,
>>
>> Back from my vacation and took some time yesterday and today to build a
>> fresh JDK 7 with some additional debug printouts from Peter's patch.
>>
>> What I found was a bit different - the rset scanning estimates are low,
>> but I consistently am seeing "Other time" estimates in the >40ms range.
>> Given my pause time goal of 20ms, these estimates are I think excluding most
>> of the regions from collectability. I haven't been able to dig around yet to
>> figure out where the long estimate for "other time" is coming from - in the
>> collections logged it sometimes shows fairly high "Other" but the "Choose
>> CSet" component is very short. I'll try to add some more debug info to the
>> verbose logging and rerun some tests over the next couple of days.
>>
>> At the moment I'm giving the JRockit VM a try to see how its deterministic
>> GC stacks up against G1 and CMS.
>>
>> Thanks
>> -Todd
>>
>>
>> On Tue, Jul 13, 2010 at 5:15 PM, Peter Schuller <
>> > wrote:
>>
>>> Ramki/Tony,
>>>
>>> > Any chance of setting +PrintHeapAtGC and -XX:+PrintHeapAtGCExtended and
>>> > sending us the log, or part of it (say between two Full GCs)? Be
>>> prepared:
>>> > this will generate piles of output. But it will give us per-region
>>> > information that might shed more light on the cause of the issue....
>>> thanks,
>>>
>>> So what I have in terms of data is (see footnotes for urls references in
>>> []):
>>>
>>> (a) A patch[1] that prints some additional information about estimated
>>> costs of region eviction, and disables the GC efficiency check that
>>> normally terminates selection of regions. (Note: This is a throw-away
>>> patch for debugging; it's not intended as a suggested change for
>>> inclusion.)
>>>
>>> (b) A log[2] showing the output of a test run I did just now, with
>>> both your flags above and my patch enabled (but without disabling the
>>> efficiency check). It shows fallback to full GC when the actual live
>>> set size is 252 MB, and the maximum heap size is 2 GB (in other words,
>>> ~ 12% liveness). An easy way to find the point of full gc is to search
>>> for the string 'full 1'.
>>>
>>> (c) A file[3] with the effective VM options during the test.
>>>
>>> (d) Instructions for how to run the test to reproduce it (I'll get to
>>> that at the end; it's simplified relative to previously).
>>>
>>> (e) Nature of the test.
>>>
>>> Discussion:
>>>
>>> WIth respect to region information: I originally tried it in response
>>> to your recommendation earlier, but I found I did not see the
>>> information I was after. Perhaps I was just misreading it, but I
>>> mostly just saw either 0% or 100% fullness, and never the actual
>>> liveness estimate as produced by the mark phase. In the log I am
>>> referring to in this E-Mail, you can see that the last printout of
>>> region information just before the live GC fits this pattern; I just
>>> don't see anything that looks like legitimate liveness information
>>> being printed. (I don't have time to dig back into it right now to
>>> double-check what it's printing.)
>>>
>>> If you scroll up from the point of the full gc until you find a bunch
>>> of output starting with "predict_region_elapsed_time_ms" you see some
>>> output resulting from the patch, with pretty extreme values such as:
>>>
>>> predict_region_elapsed_time_ms: 34.378642ms total, 34.021154ms rs scan
>>> (46542 cnum), 0.040069 copy time (20704 bytes), 0.317419 other time
>>> predict_region_elapsed_time_ms: 45.020866ms total, 44.653222ms rs scan
>>> (61087 cnum), 0.050225 copy time (25952 bytes), 0.317419 other time
>>> predict_region_elapsed_time_ms: 16.250033ms total, 15.887065ms rs scan
>>> (21734 cnum), 0.045550 copy time (23536 bytes), 0.317419 other time
>>> predict_region_elapsed_time_ms: 226.942877ms total, 226.559163ms rs
>>> scan (309940 cnum), 0.066296 copy time (34256 bytes), 0.317419 other
>>> time
>>> predict_region_elapsed_time_ms: 542.344828ms total, 541.954750ms rs
>>> scan (741411 cnum), 0.072659 copy time (37544 bytes), 0.317419 other
>>> time
>>> predict_region_elapsed_time_ms: 668.595597ms total, 668.220877ms rs
>>> scan (914147 cnum), 0.057301 copy time (29608 bytes), 0.317419 other
>>> time
>>>
>>> So in the most extreme case in the excerpt above, that's > half a
>>> second of estimate rset scanning time for a single region with 914147
>>> cards to be scanned. While not all are that extreme, lots and lots of
>>> regions are very expensive and almost only due to rset scanning costs.
>>>
>>> If you scroll down a bit to the first (and ONLY) partial that happened
>>> after the statistics accumulating from the marking phase, we see more
>>> output resulting form the patch. At the end, we see:
>>>
>>> (picked region; 0.345890ms predicted; 1.317244ms remaining; 15kb
>>> marked; 15kb maxlive; 1-1% liveness)
>>> (393380 KB left in heap.)
>>> (picked region; 0.345963ms predicted; 0.971354ms remaining; 15kb
>>> marked; 15kb maxlive; 1-1% liveness)
>>> (393365 KB left in heap.)
>>> (picked region; 0.346036ms predicted; 0.625391ms remaining; 15kb
>>> marked; 15kb maxlive; 1-1% liveness)
>>> (393349 KB left in heap.)
>>> (no more marked regions; next region too expensive (adaptive;
>>> predicted 0.346036ms > remaining 0.279355ms))
>>>
>>> So in other words, it picked a bunch of regions in order of "lowest
>>> hanging fruit". The *least* low hanging fruit picked still had
>>> liveness at 1%; in other words, there's plenty of further regions that
>>> ideally should be collected because they contain almost no garbage
>>> (ignoring the cost of collecting them).
>>>
>>> In this case, it stopped picking regions because the next region to be
>>> picked, though cheap, was the straw that broke the camel's neck and we
>>> simply exceeded the alloted time for this particular GC.
>>>
>>> However, after this partial completes, it reverts back to doing just
>>> young gc:s. In other words, even though there's *plenty* of regions
>>> with very low liveness, further partials aren't happening.
>>>
>>> By applying this part of the patch:
>>>
>>> - (adaptive_young_list_length() &&
>>> + (adaptive_young_list_length() && false && // scodetodo
>>>
>>> I artificially force g1 to not fall back to doing young gc:s for
>>> efficiency reasons. When I run with that change, I don't experience
>>> the slow perpetual growth until fallback to full GC. If I remember
>>> correctly though, the rset scanning cost is in fact high, but I don't
>>> have details saved and I'm afraid I don't have time to re-run those
>>> tests right now and compare numbers.
>>>
>>> Reproducing it:
>>>
>>> I made some changes and the test case should now hopefully be easy to
>>> run assuming you have maven installed. The github project is at:
>>>
>>> http://github.com/scode/httpgctest
>>>
>>> There is a README, but the shortest possible instructions to
>>> re-produce the test that I did:
>>>
>>> git clone git://github.com/scode/httpgctest.git
>>> cd httpgctest.git
>>> git checkout 20100714_1 # grab from appropriate tag, in case I
>>> change master
>>> mvn package
>>> HTTPGCTEST_LOGGC=gc.log ./run.sh
>>>
>>> That should start the http server; then run concurrently:
>>>
>>> while [ 1 ] ; do curl 'http://localhost:9191/dropdata?ratio=0.10' ;
>>> curl 'http://localhost:9191/gendata?amount=25000' ; sleep 0.1 ; done
>>>
>>> And then just wait and observe.
>>>
>>> Nature of the test:
>>>
>>> So the test if run as above will essentially reach a steady state of
>>> equilibrium with about 25000 pieces of data in a clojure immutable
>>> map. The result is that a significant amount of new data is being
>>> allocated, but very little writing to old regions is happening. The
>>> garbage generated is very well spread out over the entire heap because
>>> it goes through all objects and drops 10% (the ratio=0.10) for each
>>> iteration, after which it adds 25000 new items.
>>>
>>> In other words; not a lot of old gen writing, but lots of writes to
>>> the young gen referencing objects in the old gen.
>>>
>>> [1] http://distfiles.scode.org/mlref/g1/g1_region_live_stats_hack.patch
>>> [2]
>>> http://distfiles.scode.org/mlref/gctest/httpgctest-g1-fullgc-20100714/gc-fullfallback.log
>>> [3]
>>> http://distfiles.scode.org/mlref/gctest/httpgctest-g1-fullgc-20100714/vmoptions.txt
>>>
>>> --
>>> / Peter Schuller
>>>
>>
>>
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>
--
Todd Lipcon
Software Engineer, Cloudera
|
# 17

23-01-2011 08:21 AM
|
|
|
Todd,
Could you send a segment of the GC logs from the beginning
through the first dozen or so full GC's?
Exactly which version of the JVM are you using?
java -version
will tell us.
Do you have a test setup where you could do some experiments?
Can you send the set of CMS flags you use? It might tell
us something about the GC behavior of you application.
Might not tell us anything but it's worth a look.
Jon
On 07/06/10 13:27, Todd Lipcon wrote:
> Hi all,
>
> I work on HBase, a distributed database written in Java. We generally
> run on large heaps (8GB+), and our object lifetime distribution has
> proven pretty problematic for garbage collection (we manage a multi-GB
> LRU cache inside the process, so in CMS we tenure a lot of byte arrays
> which later get collected).
>
> In Java6, we generally run with the CMS collector, tuning down
> CMSInitiatingOccupancyFraction and constraining MaxNewSize to achieve
> fairly low pause GC, but after a week or two of uptime we often run
> into full heap compaction which takes several minutes and wreaks havoc
> on the system.
>
> Needless to say, we've been watching the development of the G1 GC with
> anticipation for the last year or two. Finally it seems in the latest
> build of JDK7 it's stable enough for actual use (previously it would
> segfault within an hour or so). However, in my testing I'm seeing a
> fair amount of 8-10 second Full GC pauses.
>
> The flags I'm passing are:
>
> -Xmx8000m -XX:+UseG1GC -XX:MaxGCPauseMillis=20
> -XX:GCPauseIntervalMillis=80
>
> Most of the pauses I see in the GC log are around 10-20ms as expected:
>
> 2010-07-05T22:43:19.849-0700: 1680.079: [GC pause (partial),
> 0.01209000 secs]
> [Parallel Time: 10.5 ms]
> [Update RS (Start) (ms): 1680080.2 1680080.1 1680080.2
> 1680079.9 1680080.0 1680080.2 1680080.1 1680080.1 1680080.0
> 1680080.1 1680080.0 1680079.9 1680081.5]
> [Update RS (ms): 1.4 2.0 2.2 1.8 1.7 1.4 2.5 2.2 1.9
> 2.5 1.7 1.7 0.1
> Avg: 1.8, Min: 0.1, Max: 2.5]
> [Processed Buffers : 8 1 3 1 1 7 3 2 6 2 7 8 3
> Sum: 52, Avg: 4, Min: 1, Max: 8]
> [Ext Root Scanning (ms): 0.7 0.5 0.5 0.3 0.5 0.7 0.4 0.5
> 0.4 0.5 0.3 0.3 0.0
> Avg: 0.4, Min: 0.0, Max: 0.7]
> [Mark Stack Scanning (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0
> 0.0 0.0 0.0 0.0 0.0 0.0
> Avg: 0.0, Min: 0.0, Max: 0.0]
> [Scan RS (ms): 0.9 0.4 0.1 0.7 0.7 0.9 0.0 0.1 0.6 0.0
> 0.8 0.8 0.9
> Avg: 0.5, Min: 0.0, Max: 0.9]
> [Object Copy (ms): 7.2 7.2 7.3 7.3 7.1 7.1 7.0 7.2 7.1
> 6.9 7.1 7.1 7.0
> Avg: 7.1, Min: 6.9, Max: 7.3]
> [Termination (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
> 0.0 0.0 0.0 0.0
> Avg: 0.0, Min: 0.0, Max: 0.0]
> [Other: 0.6 ms]
> [Clear CT: 0.5 ms]
> [Other: 1.1 ms]
> [Choose CSet: 0.0 ms]
> [ 7677M->7636M(8000M)]
> [Times: user=0.12 sys=0.00, real=0.01 secs]
>
> But every 5-10 minutes I see a GC pause that lasts 10-15 seconds:
> [todd@monster01 logs]$ grep 'Full GC' gc-hbase.log | tail
> 2010-07-06T12:50:41.216-0700: 52521.446: [Full GC 7934M->4865M(8000M),
> 9.8907800 secs]
> 2010-07-06T12:55:39.802-0700: 52820.032: [Full GC 7930M->4964M(8000M),
> 9.9025520 secs]
> 2010-07-06T13:02:26.872-0700: 53227.102: [Full GC 7934M->4882M(8000M),
> 10.1232190 secs]
> 2010-07-06T13:09:41.049-0700: 53661.279: [Full GC 7938M->5002M(8000M),
> 10.4997760 secs]
> 2010-07-06T13:18:51.531-0700: 54211.761: [Full GC 7938M->4962M(8000M),
> 11.0497380 secs]
>
> These pauses are pretty unacceptable for soft real time operation.
>
> Am I missing some tuning that should be done for G1GC for applications
> like this? Is 20ms out of 80ms too aggressive a target for the garbage
> rates we're generating?
>
> My actual live heap usage should be very stable around 5GB - the
> application very carefully accounts its live object set at around 60%
> of the max heap (as you can see in the logs above).
>
> At this point we are considering doing crazy things like ripping out
> our main memory consumers into a custom slab allocator, and manually
> reference count the byte array slices. But, if we can get G1GC to work
> for us, it will save a lot of engineering on the application side!
>
> Thanks
> -Todd
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
> ------------------------------------------------------------------------
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
Did you try doubling the heap size? You might want to post a full
log so we can see what's happening between those full collections.
Also, If you have comparable CMS logs
all the better, as a known starting point. The full gc's almost
look like the heap got too full, so it must mean that incremental
collection is not keeping up with the rate of garbage generation.
Also, what's the JDK build you are running?
-- ramki
On 07/06/10 13:27, Todd Lipcon wrote:
> Hi all,
>
> I work on HBase, a distributed database written in Java. We generally
> run on large heaps (8GB+), and our object lifetime distribution has
> proven pretty problematic for garbage collection (we manage a multi-GB
> LRU cache inside the process, so in CMS we tenure a lot of byte arrays
> which later get collected).
>
> In Java6, we generally run with the CMS collector, tuning down
> CMSInitiatingOccupancyFraction and constraining MaxNewSize to achieve
> fairly low pause GC, but after a week or two of uptime we often run into
> full heap compaction which takes several minutes and wreaks havoc on the
> system.
>
> Needless to say, we've been watching the development of the G1 GC with
> anticipation for the last year or two. Finally it seems in the latest
> build of JDK7 it's stable enough for actual use (previously it would
> segfault within an hour or so). However, in my testing I'm seeing a fair
> amount of 8-10 second Full GC pauses.
>
> The flags I'm passing are:
>
> -Xmx8000m -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:GCPauseIntervalMillis=80
>
> Most of the pauses I see in the GC log are around 10-20ms as expected:
>
> 2010-07-05T22:43:19.849-0700: 1680.079: [GC pause (partial), 0.01209000
> secs]
> [Parallel Time: 10.5 ms]
> [Update RS (Start) (ms): 1680080.2 1680080.1 1680080.2
> 1680079.9 1680080.0 1680080.2 1680080.1 1680080.1 1680080.0
> 1680080.1 1680080.0 1680079.9 1680081.5]
> [Update RS (ms): 1.4 2.0 2.2 1.8 1.7 1.4 2.5 2.2 1.9 2.5
> 1.7 1.7 0.1
> Avg: 1.8, Min: 0.1, Max: 2.5]
> [Processed Buffers : 8 1 3 1 1 7 3 2 6 2 7 8 3
> Sum: 52, Avg: 4, Min: 1, Max: 8]
> [Ext Root Scanning (ms): 0.7 0.5 0.5 0.3 0.5 0.7 0.4 0.5
> 0.4 0.5 0.3 0.3 0.0
> Avg: 0.4, Min: 0.0, Max: 0.7]
> [Mark Stack Scanning (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
> 0.0 0.0 0.0 0.0 0.0
> Avg: 0.0, Min: 0.0, Max: 0.0]
> [Scan RS (ms): 0.9 0.4 0.1 0.7 0.7 0.9 0.0 0.1 0.6 0.0
> 0.8 0.8 0.9
> Avg: 0.5, Min: 0.0, Max: 0.9]
> [Object Copy (ms): 7.2 7.2 7.3 7.3 7.1 7.1 7.0 7.2 7.1
> 6.9 7.1 7.1 7.0
> Avg: 7.1, Min: 6.9, Max: 7.3]
> [Termination (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
> 0.0 0.0 0.0 0.0
> Avg: 0.0, Min: 0.0, Max: 0.0]
> [Other: 0.6 ms]
> [Clear CT: 0.5 ms]
> [Other: 1.1 ms]
> [Choose CSet: 0.0 ms]
> [ 7677M->7636M(8000M)]
> [Times: user=0.12 sys=0.00, real=0.01 secs]
>
> But every 5-10 minutes I see a GC pause that lasts 10-15 seconds:
> [todd@monster01 logs]$ grep 'Full GC' gc-hbase.log | tail
> 2010-07-06T12:50:41.216-0700: 52521.446: [Full GC 7934M->4865M(8000M),
> 9.8907800 secs]
> 2010-07-06T12:55:39.802-0700: 52820.032: [Full GC 7930M->4964M(8000M),
> 9.9025520 secs]
> 2010-07-06T13:02:26.872-0700: 53227.102: [Full GC 7934M->4882M(8000M),
> 10.1232190 secs]
> 2010-07-06T13:09:41.049-0700: 53661.279: [Full GC 7938M->5002M(8000M),
> 10.4997760 secs]
> 2010-07-06T13:18:51.531-0700: 54211.761: [Full GC 7938M->4962M(8000M),
> 11.0497380 secs]
>
> These pauses are pretty unacceptable for soft real time operation.
>
> Am I missing some tuning that should be done for G1GC for applications
> like this? Is 20ms out of 80ms too aggressive a target for the garbage
> rates we're generating?
>
> My actual live heap usage should be very stable around 5GB - the
> application very carefully accounts its live object set at around 60% of
> the max heap (as you can see in the logs above).
>
> At this point we are considering doing crazy things like ripping out our
> main memory consumers into a custom slab allocator, and manually
> reference count the byte array slices. But, if we can get G1GC to work
> for us, it will save a lot of engineering on the application side!
>
> Thanks
> -Todd
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>
>
> ------------------------------------------------------------------------
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
I also work with Todd on this systems (I am one of the other people
with the alternate CMS config) and doubling the heap size from 8 GB to
16GB is a little insane... we'd like to have some amount of reasonable
memory efficiency here... The thing is the more we can get out of our
ram for this block cache, the better performing our systems are. Also
a lot of the settings are self tuning, so if we up the Xmx the size of
the block cache is scaled as well.
-ryan
On Tue, Jul 6, 2010 at 2:12 PM, Y. S. Ramakrishna
<> wrote:
> Did you try doubling the heap size? You might want to post a full
> log so we can see what's happening between those full collections.
> Also, If you have comparable CMS logs
> all the better, as a known starting point. The full gc's almost
> look like the heap got too full, so it must mean that incremental
> collection is not keeping up with the rate of garbage generation.
> Also, what's the JDK build you are running?
>
> -- ramki
>
> On 07/06/10 13:27, Todd Lipcon wrote:
>> Hi all,
>>
>> I work on HBase, a distributed database written in Java. We generally
>> run on large heaps (8GB+), and our object lifetime distribution has
>> proven pretty problematic for garbage collection (we manage a multi-GB
>> LRU cache inside the process, so in CMS we tenure a lot of byte arrays
>> which later get collected).
>>
>> In Java6, we generally run with the CMS collector, tuning down
>> CMSInitiatingOccupancyFraction and constraining MaxNewSize to achieve
>> fairly low pause GC, but after a week or two of uptime we often run into
>> full heap compaction which takes several minutes and wreaks havoc on the
>> system.
>>
>> Needless to say, we've been watching the development of the G1 GC with
>> anticipation for the last year or two. Finally it seems in the latest
>> build of JDK7 it's stable enough for actual use (previously it would
>> segfault within an hour or so). However, in my testing I'm seeing a fair
>> amount of 8-10 second Full GC pauses.
>>
>> The flags I'm passing are:
>>
>> -Xmx8000m -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:GCPauseIntervalMillis=80
>>
>> Most of the pauses I see in the GC log are around 10-20ms as expected:
>>
>> 2010-07-05T22:43:19.849-0700: 1680.079: [GC pause (partial), 0.01209000
>> secs]
>> [Parallel Time: 10.5 ms]
>> [Update RS (Start) (ms): 1680080.2 1680080.1 1680080.2
>> 1680079.9 1680080.0 1680080.2 1680080.1 1680080.1 1680080.0
>> 1680080.1 1680080.0 1680079.9 1680081.5]
>> [Update RS (ms): 1.4 2.0 2.2 1.8 1.7 1.4 2.5 2.2 1.9 2.5
>> 1.7 1.7 0.1
>> Avg: 1.8, Min: 0.1, Max: 2.5]
>> [Processed Buffers : 8 1 3 1 1 7 3 2 6 2 7 8 3
>> Sum: 52, Avg: 4, Min: 1, Max: 8]
>> [Ext Root Scanning (ms): 0.7 0.5 0.5 0.3 0.5 0.7 0.4 0.5
>> 0.4 0.5 0.3 0.3 0.0
>> Avg: 0.4, Min: 0.0, Max: 0.7]
>> [Mark Stack Scanning (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
>> 0.0 0.0 0.0 0.0 0.0
>> Avg: 0.0, Min: 0.0, Max: 0.0]
>> [Scan RS (ms): 0.9 0.4 0.1 0.7 0.7 0.9 0.0 0.1 0.6 0.0
>> 0.8 0.8 0.9
>> Avg: 0.5, Min: 0.0, Max: 0.9]
>> [Object Copy (ms): 7.2 7.2 7.3 7.3 7.1 7.1 7.0 7.2 7.1
>> 6.9 7.1 7.1 7.0
>> Avg: 7.1, Min: 6.9, Max: 7.3]
>> [Termination (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
>> 0.0 0.0 0.0 0.0
>> Avg: 0.0, Min: 0.0, Max: 0.0]
>> [Other: 0.6 ms]
>> [Clear CT: 0.5 ms]
>> [Other: 1.1 ms]
>> [Choose CSet: 0.0 ms]
>> [ 7677M->7636M(8000M)]
>> [Times: user=0.12 sys=0.00, real=0.01 secs]
>>
>> But every 5-10 minutes I see a GC pause that lasts 10-15 seconds:
>> [todd@monster01 logs]$ grep 'Full GC' gc-hbase.log | tail
>> 2010-07-06T12:50:41.216-0700: 52521.446: [Full GC 7934M->4865M(8000M),
>> 9.8907800 secs]
>> 2010-07-06T12:55:39.802-0700: 52820.032: [Full GC 7930M->4964M(8000M),
>> 9.9025520 secs]
>> 2010-07-06T13:02:26.872-0700: 53227.102: [Full GC 7934M->4882M(8000M),
>> 10.1232190 secs]
>> 2010-07-06T13:09:41.049-0700: 53661.279: [Full GC 7938M->5002M(8000M),
>> 10.4997760 secs]
>> 2010-07-06T13:18:51.531-0700: 54211.761: [Full GC 7938M->4962M(8000M),
>> 11.0497380 secs]
>>
>> These pauses are pretty unacceptable for soft real time operation.
>>
>> Am I missing some tuning that should be done for G1GC for applications
>> like this? Is 20ms out of 80ms too aggressive a target for the garbage
>> rates we're generating?
>>
>> My actual live heap usage should be very stable around 5GB - the
>> application very carefully accounts its live object set at around 60% of
>> the max heap (as you can see in the logs above).
>>
>> At this point we are considering doing crazy things like ripping out our
>> main memory consumers into a custom slab allocator, and manually
>> reference count the byte array slices. But, if we can get G1GC to work
>> for us, it will save a lot of engineering on the application side!
>>
>> Thanks
>> -Todd
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
>>
>> ------------------------------------------------------------------------
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
Two questions:
(1) do you enable class unloading with CMS? (I do not see that
below in yr option list, but wondered.)
(2) does your application load classes, or intern a lot of strings?
If i am reading the logs right, G1 appears to reclaim less
and less of the heap in each cycle until a full collection
intervenes, and I have no real explanation for this behaviour
except that perhaps there's something in the perm gen that
keeps stuff in the remainder of the heap artificially live.
G1 does not incrementally collect the young gen, so this is
plausible. But CMS does not either by default and I do not see
that option in the CMS options list you gave below. It would
be instructive to see what the comparable CMS logs look like.
May be then you could start with the same heap shapes for the
two and see if you can get to the bottom of the full gc (which
as i understand you get to more quickly w/G1 than you did
w/CMS).
-- ramki
On 07/06/10 14:24, Todd Lipcon wrote:
> On Tue, Jul 6, 2010 at 2:09 PM, Jon Masamitsu <
> > wrote:
>
> Todd,
>
> Could you send a segment of the GC logs from the beginning
> through the first dozen or so full GC's?
>
>
> Sure, I just put it online at:
>
> http://cloudera-todd.s3.amazonaws.com/gc-log-g1gc.txt
>
>
>
> Exactly which version of the JVM are you using?
>
> java -version
>
> will tell us.
>
>
> Latest as of last night:
>
> [todd@monster01 ~]$ ./jdk1.7.0/jre/bin/java -version
> java version "1.7.0-ea"
> Java(TM) SE Runtime Environment (build 1.7.0-ea-b99)
> Java HotSpot(TM) 64-Bit Server VM (build 19.0-b03, mixed mode)
>
>
> Do you have a test setup where you could do some experiments?
>
>
> Sure, I have a five node cluster here where I do lots of testing, happy
> to try different builds/options/etc (though I probably don't have time
> to apply patches and rebuild the JDK myself)
>
>
> Can you send the set of CMS flags you use? It might tell
> us something about the GC behavior of you application.
> Might not tell us anything but it's worth a look.
>
>
> Different customers have found different flags to work well for them.
> One user uses the following:
>
>
> -Xmx12000m -XX:+HeapDumpOnOutOfMemoryError -XX:+UseConcMarkSweepGC \
> -XX:NewSize=64m -XX:MaxNewSize=64m -XX:CMSInitiatingOccupancyFraction=40 \
>
>
> Another uses:
>
>
> -XX:+DoEscapeAnalysis -XX:+AggressiveOpts -XX:+UseConcMarkSweepGC
> -XX:NewSize=64m -XX:MaxNewSize=64m
> -XX:CMSInitiatingOccupancyFraction=88 -verbose:gc -XX:+PrintGCDetails
>
>
>
>
>
>
>
> The particular tuning options probably depend on the actual cache
> workload of the user. I tend to recommend CMSInitiatingOccupancyFraction
> around 75 or so, since the software maintains about 60% heap usage. I
> also think a NewSize slightly larger would improve things a bit, but if
> it gets more than 256m or so, the ParNew pauses start to be too long for
> a lot of use cases.
>
> Regarding CMS logs, I can probably restart this test later this
> afternoon on CMS and run it for a couple hours, but it isn't likely to
> hit the multi-minute compaction that quickly. It happens more in the wild.
>
> -Todd
>
>
>
> On 07/06/10 13:27, Todd Lipcon wrote:
>> Hi all,
>>
>> I work on HBase, a distributed database written in Java. We
>> generally run on large heaps (8GB+), and our object lifetime
>> distribution has proven pretty problematic for garbage collection
>> (we manage a multi-GB LRU cache inside the process, so in CMS we
>> tenure a lot of byte arrays which later get collected).
>>
>> In Java6, we generally run with the CMS collector, tuning down
>> CMSInitiatingOccupancyFraction and constraining MaxNewSize to
>> achieve fairly low pause GC, but after a week or two of uptime we
>> often run into full heap compaction which takes several minutes
>> and wreaks havoc on the system.
>>
>> Needless to say, we've been watching the development of the G1 GC
>> with anticipation for the last year or two. Finally it seems in
>> the latest build of JDK7 it's stable enough for actual use
>> (previously it would segfault within an hour or so). However, in
>> my testing I'm seeing a fair amount of 8-10 second Full GC pauses.
>>
>> The flags I'm passing are:
>>
>> -Xmx8000m -XX:+UseG1GC -XX:MaxGCPauseMillis=20
>> -XX:GCPauseIntervalMillis=80
>>
>> Most of the pauses I see in the GC log are around 10-20ms as expected:
>>
>> 2010-07-05T22:43:19.849-0700: 1680.079: [GC pause (partial),
>> 0.01209000 secs]
>> [Parallel Time: 10.5 ms]
>> [Update RS (Start) (ms): 1680080.2 1680080.1 1680080.2
>> 1680079.9 1680080.0 1680080.2 1680080.1 1680080.1 1680080.0
>> 1680080.1 1680080.0 1680079.9 1680081.5]
>> [Update RS (ms): 1.4 2.0 2.2 1.8 1.7 1.4 2.5 2.2
>> 1.9 2.5 1.7 1.7 0.1
>> Avg: 1.8, Min: 0.1, Max: 2.5]
>> [Processed Buffers : 8 1 3 1 1 7 3 2 6 2 7 8 3
>> Sum: 52, Avg: 4, Min: 1, Max: 8]
>> [Ext Root Scanning (ms): 0.7 0.5 0.5 0.3 0.5 0.7 0.4
>> 0.5 0.4 0.5 0.3 0.3 0.0
>> Avg: 0.4, Min: 0.0, Max: 0.7]
>> [Mark Stack Scanning (ms): 0.0 0.0 0.0 0.0 0.0 0.0
>> 0.0 0.0 0.0 0.0 0.0 0.0 0.0
>> Avg: 0.0, Min: 0.0, Max: 0.0]
>> [Scan RS (ms): 0.9 0.4 0.1 0.7 0.7 0.9 0.0 0.1 0.6
>> 0.0 0.8 0.8 0.9
>> Avg: 0.5, Min: 0.0, Max: 0.9]
>> [Object Copy (ms): 7.2 7.2 7.3 7.3 7.1 7.1 7.0 7.2
>> 7.1 6.9 7.1 7.1 7.0
>> Avg: 7.1, Min: 6.9, Max: 7.3]
>> [Termination (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
>> 0.0 0.0 0.0 0.0 0.0
>> Avg: 0.0, Min: 0.0, Max: 0.0]
>> [Other: 0.6 ms]
>> [Clear CT: 0.5 ms]
>> [Other: 1.1 ms]
>> [Choose CSet: 0.0 ms]
>> [ 7677M->7636M(8000M)]
>> [Times: user=0.12 sys=0.00, real=0.01 secs]
>>
>> But every 5-10 minutes I see a GC pause that lasts 10-15 seconds:
>> [todd@monster01 logs]$ grep 'Full GC' gc-hbase.log | tail
>> 2010-07-06T12:50:41.216-0700: 52521.446: [Full GC
>> 7934M->4865M(8000M), 9.8907800 secs]
>> 2010-07-06T12:55:39.802-0700: 52820.032: [Full GC
>> 7930M->4964M(8000M), 9.9025520 secs]
>> 2010-07-06T13:02:26.872-0700: 53227.102: [Full GC
>> 7934M->4882M(8000M), 10.1232190 secs]
>> 2010-07-06T13:09:41.049-0700: 53661.279: [Full GC
>> 7938M->5002M(8000M), 10.4997760 secs]
>> 2010-07-06T13:18:51.531-0700: 54211.761: [Full GC
>> 7938M->4962M(8000M), 11.0497380 secs]
>>
>> These pauses are pretty unacceptable for soft real time operation.
>>
>> Am I missing some tuning that should be done for G1GC for
>> applications like this? Is 20ms out of 80ms too aggressive a
>> target for the garbage rates we're generating?
>>
>> My actual live heap usage should be very stable around 5GB - the
>> application very carefully accounts its live object set at around
>> 60% of the max heap (as you can see in the logs above).
>>
>> At this point we are considering doing crazy things like ripping
>> out our main memory consumers into a custom slab allocator, and
>> manually reference count the byte array slices. But, if we can get
>> G1GC to work for us, it will save a lot of engineering on the
>> application side!
>>
>> Thanks
>> -Todd
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>> ------------------------------------------------------------------------
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
On 07/07/10 08:45, Todd Lipcon wrote:
...
>
> Overnight I saw one "concurrent mode failure".
...
> 2010-07-07T07:56:27.786-0700: 28490.203: [GC 28490.203: [ParNew
> (promotion failed): 59008K->59008K(59008K), 0.0179250 secs]28490.221:
> [CMS2010-07-07T07:56:27.901-0700: 28490.317: [CMS-concurrent-preclean:
> 0.556/0.947 secs] [Times:
> user=5.76 sys=0.26, real=0.95 secs]
> (concurrent mode failure): 6359176K->4206871K(8323072K), 17.4366220
> secs] 6417373K->4206871K(8382080K), [CMS Perm : 18609K->18565K(31048K)],
> 17.4546890 secs] [Times: user=11.17 sys=0.09, real=17.45 secs]
>
> I've interpreted pauses like this as being caused by fragmentation,
> since the young gen is 64M, and the old gen here has about 2G free. If
> there's something I'm not understanding about CMS, and I can tune it
> more smartly to avoid these longer pauses, I'm happy to try.
Yes the old gen must be fragmented. I'll look at the data you have
made available (for CMS). The CMS log you uploaded does not have the
suffix leading into the concurrent mode failure ypu display above
(it stops less than 2500 s into the run). If you could include
the entire log leading into the concurrent mode failures, it would
be a great help. Do you have large arrays in your
application? The shape of the promotion graph for CMS is somewhat
jagged, indicating _perhaps_ that. Yes, +PrintTenuringDistribution
would shed a bit more light. As regards fragmentation, it can be
tricky to tune against, but we can try once we understand a bit
more about the object sizes and demographics.
I am sure you don't have an easily shared test case, so we
can reproduce both the CMS fragmentation and the G1 full gc
issues locally for quickest progress on this?
-- ramki
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
On 07/07/10 17:18, Todd Lipcon wrote:
...
> Looking at the graph you attached, it appears that the low-water mark
> stabilizes at somewhere between 4.5G and 5G. The configuration I'm
> running is to allocate 40% of the heap to Memstore and 20% of the heap
> to the LRU cache. For an 8G heap, this is 4.8GB. So, for this
> application it's somewhat expected that, as it runs, it will accumulate
> more and more data until it reaches this threshold. The data is, of
> course, not *permanent*, but it's reasonably long-lived, so it makes
> sense to me that it should go into the old generation.
Ah, i see. In that case, i think you could try using a slightly larger old
gen. If the old gen stabilizes at 4.2 GB, we should allow as much for slop.
i.e. make the old gen 8.4 GB (or whatever is the measured stable
old gen occupancy), then add to that the young gen size, and use
that for the whole heap. I would be even more aggressive
and grant more to the old gen -- as i said earlier perhaps
double the old gen from its present size. If that doesn;t work
we know that something is amiss in the way we are going at this.
If it works, we can iterate downwards from a config that we know
works, down to what may be considered an acceptable space overhead
for GC.
>
> If you like, I can tune down those percentages to 20/20 instead of
> 20/40, and I think we'll see the same pattern, just stabilized around
> 3.2GB. This will probably delay the full GCs, but still eventually hit
> them. It's also way lower than we can really go - customers won't like
> "throwing away" 60% of the allocated heap to GC!
I understand that sentiment. I want us to get to a state where we are able
to completely avoid the creeping fragmentation, if possible. There are
other ways to tune for this, but they are more labour-intensive and tricky,
and I would not want to go into that lightly. You might want to contact
your Java support for help with that.
>
>
> Perhaps jmap -histo:live or +PrintClassHistogram[AfterFullGC]
> will help you get to the bottom of yr leak. Once the leak is plugged
> perhaps we could come back to the G1 tuning effort? (We have some
> guesses as to what might be happening and the best G1 minds are
> chewing on the info you provided so far, for which thanks!)
>
>
> I can try running with those options and see what I see, but I've
> already spent some time looking at heap dumps, and not found any leaks,
> so I'm pretty sure it's not the issue.
OK, in that case it's not worth doing, since you've already ruled
out leaks.
I'll think some more about this meanwhile.
thanks.
-- ramki
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
On 07/07/10 17:32, Todd Lipcon wrote:
...
> OK, I can try some tests with cache configured for only 40% heap usage.
> Should I run these tests with CMS or G1?
I'd first try CMS, and if that works, try G1.
>
>
>
>
> If you like, I can tune down those percentages to 20/20 instead
> of 20/40, and I think we'll see the same pattern, just
> stabilized around 3.2GB. This will probably delay the full GCs,
> but still eventually hit them. It's also way lower than we can
> really go - customers won't like "throwing away" 60% of the
> allocated heap to GC!
>
>
> I understand that sentiment. I want us to get to a state where we
> are able
> to completely avoid the creeping fragmentation, if possible. There are
> other ways to tune for this, but they are more labour-intensive and
> tricky,
> and I would not want to go into that lightly. You might want to contact
> your Java support for help with that.
>
>
> Yep, we've considered various solutions involving managing our own
> ref-counted slices of a single pre-allocated byte array - essentially
> writing our own slab allocator. In theory this should make all of the
> GCable objects constrained to a small number of sizes, and thus prevent
> fragmentation, but it's quite a project to undertake :)
That would be overdoing it. I didn't mean anything so drastic and certainly
nothing so drastic at the application level. When I said "labour intensive"
I meant tuning GC to avoid that kind of fragmentation would be more work.
>
> Regarding Java support, as an open source project we have no such
> luxury. Projects like HBase and Hadoop, though, are pretty visible to
> users as "big Java apps", so getting them working well on the GC front
> does good things for Java adoption in the database/distributed systems
> community, I think.
I agree, and we certainly should.
-- ramki
>
> -Todd
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>
>
> ------------------------------------------------------------------------
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
> Am I missing some tuning that should be done for G1GC for applications like
> this? Is 20ms out of 80ms too aggressive a target for the garbage rates
> we're generating?
I have never run HBase, but in an LRU stress test (I posted about it a
few months ago) I specifically observed remembered set scanning costs
go way up. In addition I was seeing fallbacks to full GC:s recently in
a slightly different test that I also posed about to -use, and that
turned out to be a result of the estimated rset scanning costs being
so high that regions were never selected for eviction even though they
had very little live data. I would be very interested to hear if
you're having the same problem. My last post on the topic is here:
http://mail.openjdk.java.net/pipermail/hotspot-gc-use/2010-June/000652.html
Including the link to the (throw-away) patch that should tell you
whether this is what's happening:
http://distfiles.scode.org/mlref/g1/g1_region_live_stats_hack.patch
Out of personal curiosity I'd be very interested to hear whether this
is what's happening to you (in a real reasonable use-case rather than
a synthetic benchmark).
My sense (and hotspot/g1 developers please smack me around if I am
misrepresenting anything here) is that the effect I saw (with rset
scanning costs) could cause perpetual memory grow (until fallback to
full GC) in two ways:
(1) The estimated (and possibly real) cost of rset scanning for a
single region could be so high that it is never possible to select it
for eviction given the asked for pause time goals. Hence, such a
region effectively "leaks" until full GC.
(2) The estimated (and possibly real) cost of rset scanning for
regions may be so high that there are, in practice, always other
regions selected for high pay-off/cost ratios, such that they end up
never being collected even if theoretically a single region could be
evicted within the pause time goal.
These are effectively the same thing, with (1) being an extreme case of (2).
In both cases, the effect should be mitigated (and have been in the
case where I did my testing), but as far as I can tell not generally
"fixed", by increasing the pause time goals.
It is unclear to me how this is intended to be handled. The original
g1 paper mentions an rset scanning thread that I may suspect would be
intended to help do rset scanning in the background such that regions
like these could be evicted more cheaply during the STW eviction
pause; but I didn't find such a thread anywhere in the source code -
but I may very well just be missing it.
--
/ Peter Schuller
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
Hi Peter --
Yes, my guess was also that something (possibly along the lines
you stated below) was preventing the selection of certain (sets
of) regions for evacuation on a regular basis ... I am told there
are flags that will allow you to get verbose details on what is
or is not selected for inclusion in the collection set; perhaps
that will help you get down to the bottom of this. Did you say
you had a test case that showed this behaviour? Filing a bug
with that test case may be the quickest way to get this before
the right set of eyes. Over to the G1 cognoscenti.
-- ramki
On 07/12/10 09:02, Peter Schuller wrote:
>> Am I missing some tuning that should be done for G1GC for applications like
>> this? Is 20ms out of 80ms too aggressive a target for the garbage rates
>> we're generating?
>
> I have never run HBase, but in an LRU stress test (I posted about it a
> few months ago) I specifically observed remembered set scanning costs
> go way up. In addition I was seeing fallbacks to full GC:s recently in
> a slightly different test that I also posed about to -use, and that
> turned out to be a result of the estimated rset scanning costs being
> so high that regions were never selected for eviction even though they
> had very little live data. I would be very interested to hear if
> you're having the same problem. My last post on the topic is here:
>
> http://mail.openjdk.java.net/pipermail/hotspot-gc-use/2010-June/000652.html
>
> Including the link to the (throw-away) patch that should tell you
> whether this is what's happening:
>
> http://distfiles.scode.org/mlref/g1/g1_region_live_stats_hack.patch
>
> Out of personal curiosity I'd be very interested to hear whether this
> is what's happening to you (in a real reasonable use-case rather than
> a synthetic benchmark).
>
> My sense (and hotspot/g1 developers please smack me around if I am
> misrepresenting anything here) is that the effect I saw (with rset
> scanning costs) could cause perpetual memory grow (until fallback to
> full GC) in two ways:
>
> (1) The estimated (and possibly real) cost of rset scanning for a
> single region could be so high that it is never possible to select it
> for eviction given the asked for pause time goals. Hence, such a
> region effectively "leaks" until full GC.
>
> (2) The estimated (and possibly real) cost of rset scanning for
> regions may be so high that there are, in practice, always other
> regions selected for high pay-off/cost ratios, such that they end up
> never being collected even if theoretically a single region could be
> evicted within the pause time goal.
>
> These are effectively the same thing, with (1) being an extreme case of (2).
>
> In both cases, the effect should be mitigated (and have been in the
> case where I did my testing), but as far as I can tell not generally
> "fixed", by increasing the pause time goals.
>
> It is unclear to me how this is intended to be handled. The original
> g1 paper mentions an rset scanning thread that I may suspect would be
> intended to help do rset scanning in the background such that regions
> like these could be evicted more cheaply during the STW eviction
> pause; but I didn't find such a thread anywhere in the source code -
> but I may very well just be missing it.
>
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
Peter and Todd,
Any chance of setting +PrintHeapAtGC and -XX:+PrintHeapAtGCExtended and
sending us the log, or part of it (say between two Full GCs)? Be
prepared: this will generate piles of output. But it will give us
per-region information that might shed more light on the cause of the
issue.... thanks,
Tony, HS GC Group
Peter Schuller wrote:
>> Am I missing some tuning that should be done for G1GC for applications like
>> this? Is 20ms out of 80ms too aggressive a target for the garbage rates
>> we're generating?
>>
>
> I have never run HBase, but in an LRU stress test (I posted about it a
> few months ago) I specifically observed remembered set scanning costs
> go way up. In addition I was seeing fallbacks to full GC:s recently in
> a slightly different test that I also posed about to -use, and that
> turned out to be a result of the estimated rset scanning costs being
> so high that regions were never selected for eviction even though they
> had very little live data. I would be very interested to hear if
> you're having the same problem. My last post on the topic is here:
>
> http://mail.openjdk.java.net/pipermail/hotspot-gc-use/2010-June/000652.html
>
> Including the link to the (throw-away) patch that should tell you
> whether this is what's happening:
>
> http://distfiles.scode.org/mlref/g1/g1_region_live_stats_hack.patch
>
> Out of personal curiosity I'd be very interested to hear whether this
> is what's happening to you (in a real reasonable use-case rather than
> a synthetic benchmark).
>
> My sense (and hotspot/g1 developers please smack me around if I am
> misrepresenting anything here) is that the effect I saw (with rset
> scanning costs) could cause perpetual memory grow (until fallback to
> full GC) in two ways:
>
> (1) The estimated (and possibly real) cost of rset scanning for a
> single region could be so high that it is never possible to select it
> for eviction given the asked for pause time goals. Hence, such a
> region effectively "leaks" until full GC.
>
> (2) The estimated (and possibly real) cost of rset scanning for
> regions may be so high that there are, in practice, always other
> regions selected for high pay-off/cost ratios, such that they end up
> never being collected even if theoretically a single region could be
> evicted within the pause time goal.
>
> These are effectively the same thing, with (1) being an extreme case of (2).
>
> In both cases, the effect should be mitigated (and have been in the
> case where I did my testing), but as far as I can tell not generally
> "fixed", by increasing the pause time goals.
>
> It is unclear to me how this is intended to be handled. The original
> g1 paper mentions an rset scanning thread that I may suspect would be
> intended to help do rset scanning in the background such that regions
> like these could be evicted more cheaply during the STW eviction
> pause; but I didn't find such a thread anywhere in the source code -
> but I may very well just be missing it.
>
>
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
Ramki/Tony,
> Any chance of setting +PrintHeapAtGC and -XX:+PrintHeapAtGCExtended and
> sending us the log, or part of it (say between two Full GCs)? Be prepared:
> this will generate piles of output. But it will give us per-region
> information that might shed more light on the cause of the issue.... thanks,
So what I have in terms of data is (see footnotes for urls references in []):
(a) A patch[1] that prints some additional information about estimated
costs of region eviction, and disables the GC efficiency check that
normally terminates selection of regions. (Note: This is a throw-away
patch for debugging; it's not intended as a suggested change for
inclusion.)
(b) A log[2] showing the output of a test run I did just now, with
both your flags above and my patch enabled (but without disabling the
efficiency check). It shows fallback to full GC when the actual live
set size is 252 MB, and the maximum heap size is 2 GB (in other words,
~ 12% liveness). An easy way to find the point of full gc is to search
for the string 'full 1'.
(c) A file[3] with the effective VM options during the test.
(d) Instructions for how to run the test to reproduce it (I'll get to
that at the end; it's simplified relative to previously).
(e) Nature of the test.
Discussion:
WIth respect to region information: I originally tried it in response
to your recommendation earlier, but I found I did not see the
information I was after. Perhaps I was just misreading it, but I
mostly just saw either 0% or 100% fullness, and never the actual
liveness estimate as produced by the mark phase. In the log I am
referring to in this E-Mail, you can see that the last printout of
region information just before the live GC fits this pattern; I just
don't see anything that looks like legitimate liveness information
being printed. (I don't have time to dig back into it right now to
double-check what it's printing.)
If you scroll up from the point of the full gc until you find a bunch
of output starting with "predict_region_elapsed_time_ms" you see some
output resulting from the patch, with pretty extreme values such as:
predict_region_elapsed_time_ms: 34.378642ms total, 34.021154ms rs scan
(46542 cnum), 0.040069 copy time (20704 bytes), 0.317419 other time
predict_region_elapsed_time_ms: 45.020866ms total, 44.653222ms rs scan
(61087 cnum), 0.050225 copy time (25952 bytes), 0.317419 other time
predict_region_elapsed_time_ms: 16.250033ms total, 15.887065ms rs scan
(21734 cnum), 0.045550 copy time (23536 bytes), 0.317419 other time
predict_region_elapsed_time_ms: 226.942877ms total, 226.559163ms rs
scan (309940 cnum), 0.066296 copy time (34256 bytes), 0.317419 other
time
predict_region_elapsed_time_ms: 542.344828ms total, 541.954750ms rs
scan (741411 cnum), 0.072659 copy time (37544 bytes), 0.317419 other
time
predict_region_elapsed_time_ms: 668.595597ms total, 668.220877ms rs
scan (914147 cnum), 0.057301 copy time (29608 bytes), 0.317419 other
time
So in the most extreme case in the excerpt above, that's > half a
second of estimate rset scanning time for a single region with 914147
cards to be scanned. While not all are that extreme, lots and lots of
regions are very expensive and almost only due to rset scanning costs.
If you scroll down a bit to the first (and ONLY) partial that happened
after the statistics accumulating from the marking phase, we see more
output resulting form the patch. At the end, we see:
(picked region; 0.345890ms predicted; 1.317244ms remaining; 15kb
marked; 15kb maxlive; 1-1% liveness)
(393380 KB left in heap.)
(picked region; 0.345963ms predicted; 0.971354ms remaining; 15kb
marked; 15kb maxlive; 1-1% liveness)
(393365 KB left in heap.)
(picked region; 0.346036ms predicted; 0.625391ms remaining; 15kb
marked; 15kb maxlive; 1-1% liveness)
(393349 KB left in heap.)
(no more marked regions; next region too expensive (adaptive;
predicted 0.346036ms > remaining 0.279355ms))
So in other words, it picked a bunch of regions in order of "lowest
hanging fruit". The *least* low hanging fruit picked still had
liveness at 1%; in other words, there's plenty of further regions that
ideally should be collected because they contain almost no garbage
(ignoring the cost of collecting them).
In this case, it stopped picking regions because the next region to be
picked, though cheap, was the straw that broke the camel's neck and we
simply exceeded the alloted time for this particular GC.
However, after this partial completes, it reverts back to doing just
young gc:s. In other words, even though there's *plenty* of regions
with very low liveness, further partials aren't happening.
By applying this part of the patch:
- (adaptive_young_list_length() &&
+ (adaptive_young_list_length() && false && // scodetodo
I artificially force g1 to not fall back to doing young gc:s for
efficiency reasons. When I run with that change, I don't experience
the slow perpetual growth until fallback to full GC. If I remember
correctly though, the rset scanning cost is in fact high, but I don't
have details saved and I'm afraid I don't have time to re-run those
tests right now and compare numbers.
Reproducing it:
I made some changes and the test case should now hopefully be easy to
run assuming you have maven installed. The github project is at:
http://github.com/scode/httpgctest
There is a README, but the shortest possible instructions to
re-produce the test that I did:
git clone git://github.com/scode/httpgctest.git
cd httpgctest.git
git checkout 20100714_1 # grab from appropriate tag, in case I
change master
mvn package
HTTPGCTEST_LOGGC=gc.log ./run.sh
That should start the http server; then run concurrently:
while [ 1 ] ; do curl 'http://localhost:9191/dropdata?ratio=0.10' ;
curl 'http://localhost:9191/gendata?amount=25000' ; sleep 0.1 ; done
And then just wait and observe.
Nature of the test:
So the test if run as above will essentially reach a steady state of
equilibrium with about 25000 pieces of data in a clojure immutable
map. The result is that a significant amount of new data is being
allocated, but very little writing to old regions is happening. The
garbage generated is very well spread out over the entire heap because
it goes through all objects and drops 10% (the ratio=0.10) for each
iteration, after which it adds 25000 new items.
In other words; not a lot of old gen writing, but lots of writes to
the young gen referencing objects in the old gen.
[1] http://distfiles.scode.org/mlref/g1/g1_region_live_stats_hack.patch
[2] http://distfiles.scode.org/mlref/gctest/httpgctest-g1-fullgc-20100714/gc-fullfallback.log
[3] http://distfiles.scode.org/mlref/gctest/httpgctest-g1-fullgc-20100714/vmoptions.txt
--
/ Peter Schuller
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
> I consistently am seeing "Other time" estimates in the >40ms range. Given my
> pause time goal of 20ms, these estimates are I think excluding most of the
> regions from collectability. I haven't been able to dig around yet to figure
> out where the long estimate for "other time" is coming from - in the
> collections logged it sometimes shows fairly high "Other" but the "Choose
> CSet" component is very short.
(The following is wannabe speculation based on limited understanding
of the code, please take it with a grain of salt.)
My first thought here is swapping. My reading is that other time is
going to be the collection set selection time plus the collection set
free time (or at least intended to be). I think (am I wrong?) that
this should be really low under normal circumstances since no "bulk"
work is done really; in particular the *per-region* cost should be
low.
If the cost of these operations *per region* ended up being predicted
to > 40ms, I wonder if this was not due to swapping?
Additionally: As far as I can tell the estimated 'other' cost is based
on a history of the cost from previous GC:s and completely independent
of the particular region being evaluated.
Anyways, I suspect you've already confirmed that the system is not
actively swapping at the time of the fallback to full GC. But here is
one low-confidence hypothesis (it would be really great to hear from
one of the gc devs whether it is even remotely plausible):
* At some point in time, there was swapping happening affecting GC
operations such that the work done do gather stats and select regions
was slow (makes some sense since that should touch lots of distinct
regions and you don't need a lot of those memory accesses swapping to
accumulate quite a bit of time).
* This screwed up the 'other' cost history and thus the prediction,
possibly for both young and non-young regions.
* I believe young collections would never be entirely prevented due to
pause time goals, so here the cost history and thus predictions would
always have time to recover and you would not notice any effect
looking at the behavior of the system down the line.
* Non-young "other" cost was so high that non-young regions were never
selected. This in turn meant that additional cost history for the
"other" category was never recorded, preventing recovery from the
temporary swap storm.
* The end result is that no non-young regions are ever collected, and
you end up falling back to full GC once the young collections have
"leaked" enough garbage.
Thoughts, anyone?
--
/ Peter Schuller
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
> I consistently am seeing "Other time" estimates in the >40ms range. Given my
> pause time goal of 20ms, these estimates are I think excluding most of the
Btw, to test the hypothesis: When you say "constantly", are the times
in fact so consistent that it's either exactly the same or almost,
possibly being consistent with my proposed hypothesis that the
non-young "other" time is stuck? If the young other time is not stuck
I guess one might see some variation (I seem to get < 1 ms on my
machine) but not a lot at all in comparison to 40ms. If you're seeing
variation like 40-42 all the time, and it never decreasing
significantly after it reached the 40ms range, that would be
consistent with the hypothesis I believe.
--
/ Peter Schuller
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
> There shouldn't be any swapping during the tests - I've got RAM fairly
> carefully allocated and I believe swappiness was tuned down on those
> machines, though I will double check to be certain.
Does HBase mmap() significant amounts of memory for I/O purposes? I'm
not very familiar with HBase and a quick Googling didn't yield an
answer.
With extensive mmap():ed I/O, excessive swapping of the application
seems to be a common problem even with significant memory margins,
sometimes even with swapiness turned down to 0. I've seen it happen
under several circumstances, and based on reports on the
cassandra-user mailing list during the past couple of months it seems
I'm not alone.
To be sure I recommend checking actual swapping history (or at least
check that the absolute amount of memory swapped out is reasonable
over time).
> I'll try to read through your full email in detail while looking at the
> source and the G1 paper -- right now it's a bit above my head :)
Well, just to re-iterate though I have really only begun looking at it
myself and my ramblings may be completely off the mark.
> FWIW, my tests on JRockit JRRT's gcprio:deterministic collector didn't go
> much better - eventually it fell back to a full compaction which lasted 45
> seconds or so. HBase must be doing something that's really hard for GCs to
> deal with - either on the heuristics front or on the allocation pattern
> front.
Interesting. I don't know a lot about JRockit's implementation since
not a lot of information seems to be available. I did my LRU
micro-benchmark with a ~20-30 GB heap and JRockit. I could definitely
press it hard enough to cause a fallback, but that seemed to be
directly as a result of high allocation rates simply exceeding the
forward progress made by the GC (based on blackbox observation
anyway).
(The other problem was that the compaction pauses were never able to
complete; it seems compaction is O(n) with respect to the number of
objects being compacted, and I was unable to make it compact less than
1% per GC (because the command line option only accepted integral
percents), and with my object count the 1% was enough to hit the pause
time requirement so compaction was aborted every time. LIkely this
would have poor results over time as fragmentation becomes
significant.).
Does HBase go into periodic modes of very high allocation rate, or is
it fairly constant over time? I'm thinking that perhaps the concurrent
marking is just not triggered early enough and if large bursts of
allocations happen when the heap is relatively full, that might be the
triggering factor?
--
/ Peter Schuller
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
> Yep, I've seen JRRT also "abort compaction" on most compactions. I couldn't
> quite figure out how to tell it that it was fine to pause more often for
> compaction, so long as each pause was short.
FWIW, I got the impression at the time (but I don't remember why; I
think I was half-guessing based on assumptions about what it does and
several iterations through the documentation) that it was
fundamentally only *able* to do compaction during the stop-the-world
pause after a concurrent mark phase. I.e., I don't think you can make
it spread the work out (but I can most definitely be wrong).
--
/ Peter Schuller
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
A bit more data. I did the following patch:
@@ -1560,6 +1575,19 @@
_non_young_other_cost_per_region_ms_seq->add(non_young_other_time_ms
/
(double)
_recorded_non_young_regions);
+ } else {
+ // no non-young gen collections - if our prediction is high enough,
we would
+ // never collect non-young again, so push it back towards zero so we
give it
+ // another try.
+ double predicted_other_time = predict_non_young_other_time_ms(1);
+ if (predicted_other_time > MaxGCPauseMillis/2.0) {
+ if (G1PolicyVerbose > 0) {
+ gclog_or_tty->print_cr(
+ "Predicted non-young other time %.1f is too large compared to
max pause time. Weighting down.",
+ predicted_other_time);
+ }
+ _non_young_other_cost_per_region_ms_seq->add(0.0);
+ }
}
and this mostly solved the problem described above. Now I get a full GC
every 45-50 minutes which is way improved from what it was before.
I still seem to be putting off GC of non-young regions too much though. I
did some analysis of the G1 log and made these graphs:
http://people.apache.org/~todd/hbase-fragmentation/g1-graphing.png
The top graph is a heat map of the number of young (pink color) and
non-young (blue) in each collection.
The middle graph is the post-collection heap usage over time in MB
The bottom graph is a heat map and smoothed line graph of the number of
millis spent per collection. The target in this case is 50ms.
A few interesting things:
- not sure what causes the sort of periodic striated pattern in the number
of young generation regions chosen
- most of the time no old gen regions are selected for collection at all!
Here's a graph of just old regions:
http://people.apache.org/~todd/hbase-fragmentation/old-regions.png
- When old regions are actually selected for collection the heap usage does
drop, though elapsed time does spike over the guarantee.
So seems like something about the heuristics aren't quite right. Thoughts?
-Todd
On Fri, Jan 21, 2011 at 11:38 AM, Todd Lipcon <> wrote:
> Hey folks,
>
> Took some time over the last day or two to follow up on this on the latest
> checkout of JDK7. I added some more instrumentation and my findings so far
> are:
>
> 1) CMS is definitely hitting a fragmentation problem. Our workload is
> pretty much guaranteed to fragment, and I don't think there's anything CMS
> can do about it - see the following graphs:
> http://people.apache.org/~todd/hbase-fragmentation/
>
> 2) G1GC is hitting
> full pauses because the "other" pause time predictions end up higher than
> the minimum pause length. I'm seeing the following sequence:
>
> - A single choose_cset operation for a non_young region takes a long time
> (unclear yet why this is happening, see below)
> - This inflates the predict_non_young_other_time_ms(1) result to a value
> greater than my pause goal
> - From then on, it doesn't collect any more non-young regions (ever!)
> because any region will be considered expensive regardless of the estimated
> rset or collection costs
> - The heap slowly fills up with non-young regions until we reach a full GC
>
> 3) So the question is why the choose_cset is taking a long time. I added
> getrusage() calls to wrap the choose_cset operation. Here's some output with
> extra logging:
>
> --> About to choose cset at 725.458
> Adding 1 young regions to the CSet
> Added 0x0000000000000001 Young Regions to CS.
> (3596288 KB left in heap.)
> (picked region; 9.948053ms predicted; 21.164738ms remaining; 2448kb
> marked; 2448kb maxlive; 59-59% liveness)
> (3593839 KB left in heap.)
> predict_region_elapsed_time_ms: 10.493828ms total, 5.486072ms rs scan
> (14528 cnum), 4.828102 copy time (2333800 bytes), 0.179654 other time
> (picked region; 10.493828ms predicted; 11.216685ms remaining; 2279kb
> marked; 2279kb maxlive; 55-55% liveness)
> predict_region_elapsed_time_ms: 10.493828ms total, 5.486072ms rs scan
> (14528 cnum), 4.828102 copy time (2333800 bytes), 0.179654 other time
> (3591560 KB left in heap.)
> predict_region_elapsed_time_ms: 10.346346ms total, 5.119780ms rs scan
> (13558 cnum), 5.046912 copy time (2439568 bytes), 0.179654 other time
> predict_region_elapsed_time_ms: 10.407672ms total, 5.333135ms rs scan
> (14123 cnum), 4.894882 copy time (2366080 bytes), 0.179654 other time
> (no more marked regions; next region too expensive (adaptive; predicted
> 10.407672ms > remaining 0.722857ms))
> Resource usage of choose_cset:majflt: 0 nswap: 0 nvcsw: 6 nivcsw: 0
> --> About to prepare RS scan at 725.657
>
> The resource usage line with nvcsw=6 indicates there were 6 voluntary
> context switches while choosing cset. This choose_cset operation took
> 198.9ms all in choosing non-young.
>
> So, why are there voluntary context switches while choosing cset? This
> isn't swapping -- that should show under majflt, right? My only theories
> are:
> - are any locks acquired in choose_cset?
> - maybe the gc logging itself is blocking on IO to the log file? ie the
> instrumentation itself is interfering with the algorithm?
>
>
> Regardless, I think a single length choose_non_young_cset operation
> shouldn't be allowed to push the prediction above the time boundary and
> trigger this issue. Perhaps a simple workaround is that, whenever a
> collection chooses no non_young regions, it should contribute a value of 0
> to the average?
>
> I'll give this heuristic a try on my build and see if it solves the issue.
>
> -Todd
>
> On Tue, Jul 27, 2010 at 3:08 PM, Todd Lipcon <> wrote:
>
>> Hi all,
>>
>> Back from my vacation and took some time yesterday and today to build a
>> fresh JDK 7 with some additional debug printouts from Peter's patch.
>>
>> What I found was a bit different - the rset scanning estimates are low,
>> but I consistently am seeing "Other time" estimates in the >40ms range.
>> Given my pause time goal of 20ms, these estimates are I think excluding most
>> of the regions from collectability. I haven't been able to dig around yet to
>> figure out where the long estimate for "other time" is coming from - in the
>> collections logged it sometimes shows fairly high "Other" but the "Choose
>> CSet" component is very short. I'll try to add some more debug info to the
>> verbose logging and rerun some tests over the next couple of days.
>>
>> At the moment I'm giving the JRockit VM a try to see how its deterministic
>> GC stacks up against G1 and CMS.
>>
>> Thanks
>> -Todd
>>
>>
>> On Tue, Jul 13, 2010 at 5:15 PM, Peter Schuller <
>> > wrote:
>>
>>> Ramki/Tony,
>>>
>>> > Any chance of setting +PrintHeapAtGC and -XX:+PrintHeapAtGCExtended and
>>> > sending us the log, or part of it (say between two Full GCs)? Be
>>> prepared:
>>> > this will generate piles of output. But it will give us per-region
>>> > information that might shed more light on the cause of the issue....
>>> thanks,
>>>
>>> So what I have in terms of data is (see footnotes for urls references in
>>> []):
>>>
>>> (a) A patch[1] that prints some additional information about estimated
>>> costs of region eviction, and disables the GC efficiency check that
>>> normally terminates selection of regions. (Note: This is a throw-away
>>> patch for debugging; it's not intended as a suggested change for
>>> inclusion.)
>>>
>>> (b) A log[2] showing the output of a test run I did just now, with
>>> both your flags above and my patch enabled (but without disabling the
>>> efficiency check). It shows fallback to full GC when the actual live
>>> set size is 252 MB, and the maximum heap size is 2 GB (in other words,
>>> ~ 12% liveness). An easy way to find the point of full gc is to search
>>> for the string 'full 1'.
>>>
>>> (c) A file[3] with the effective VM options during the test.
>>>
>>> (d) Instructions for how to run the test to reproduce it (I'll get to
>>> that at the end; it's simplified relative to previously).
>>>
>>> (e) Nature of the test.
>>>
>>> Discussion:
>>>
>>> WIth respect to region information: I originally tried it in response
>>> to your recommendation earlier, but I found I did not see the
>>> information I was after. Perhaps I was just misreading it, but I
>>> mostly just saw either 0% or 100% fullness, and never the actual
>>> liveness estimate as produced by the mark phase. In the log I am
>>> referring to in this E-Mail, you can see that the last printout of
>>> region information just before the live GC fits this pattern; I just
>>> don't see anything that looks like legitimate liveness information
>>> being printed. (I don't have time to dig back into it right now to
>>> double-check what it's printing.)
>>>
>>> If you scroll up from the point of the full gc until you find a bunch
>>> of output starting with "predict_region_elapsed_time_ms" you see some
>>> output resulting from the patch, with pretty extreme values such as:
>>>
>>> predict_region_elapsed_time_ms: 34.378642ms total, 34.021154ms rs scan
>>> (46542 cnum), 0.040069 copy time (20704 bytes), 0.317419 other time
>>> predict_region_elapsed_time_ms: 45.020866ms total, 44.653222ms rs scan
>>> (61087 cnum), 0.050225 copy time (25952 bytes), 0.317419 other time
>>> predict_region_elapsed_time_ms: 16.250033ms total, 15.887065ms rs scan
>>> (21734 cnum), 0.045550 copy time (23536 bytes), 0.317419 other time
>>> predict_region_elapsed_time_ms: 226.942877ms total, 226.559163ms rs
>>> scan (309940 cnum), 0.066296 copy time (34256 bytes), 0.317419 other
>>> time
>>> predict_region_elapsed_time_ms: 542.344828ms total, 541.954750ms rs
>>> scan (741411 cnum), 0.072659 copy time (37544 bytes), 0.317419 other
>>> time
>>> predict_region_elapsed_time_ms: 668.595597ms total, 668.220877ms rs
>>> scan (914147 cnum), 0.057301 copy time (29608 bytes), 0.317419 other
>>> time
>>>
>>> So in the most extreme case in the excerpt above, that's > half a
>>> second of estimate rset scanning time for a single region with 914147
>>> cards to be scanned. While not all are that extreme, lots and lots of
>>> regions are very expensive and almost only due to rset scanning costs.
>>>
>>> If you scroll down a bit to the first (and ONLY) partial that happened
>>> after the statistics accumulating from the marking phase, we see more
>>> output resulting form the patch. At the end, we see:
>>>
>>> (picked region; 0.345890ms predicted; 1.317244ms remaining; 15kb
>>> marked; 15kb maxlive; 1-1% liveness)
>>> (393380 KB left in heap.)
>>> (picked region; 0.345963ms predicted; 0.971354ms remaining; 15kb
>>> marked; 15kb maxlive; 1-1% liveness)
>>> (393365 KB left in heap.)
>>> (picked region; 0.346036ms predicted; 0.625391ms remaining; 15kb
>>> marked; 15kb maxlive; 1-1% liveness)
>>> (393349 KB left in heap.)
>>> (no more marked regions; next region too expensive (adaptive;
>>> predicted 0.346036ms > remaining 0.279355ms))
>>>
>>> So in other words, it picked a bunch of regions in order of "lowest
>>> hanging fruit". The *least* low hanging fruit picked still had
>>> liveness at 1%; in other words, there's plenty of further regions that
>>> ideally should be collected because they contain almost no garbage
>>> (ignoring the cost of collecting them).
>>>
>>> In this case, it stopped picking regions because the next region to be
>>> picked, though cheap, was the straw that broke the camel's neck and we
>>> simply exceeded the alloted time for this particular GC.
>>>
>>> However, after this partial completes, it reverts back to doing just
>>> young gc:s. In other words, even though there's *plenty* of regions
>>> with very low liveness, further partials aren't happening.
>>>
>>> By applying this part of the patch:
>>>
>>> - (adaptive_young_list_length() &&
>>> + (adaptive_young_list_length() && false && // scodetodo
>>>
>>> I artificially force g1 to not fall back to doing young gc:s for
>>> efficiency reasons. When I run with that change, I don't experience
>>> the slow perpetual growth until fallback to full GC. If I remember
>>> correctly though, the rset scanning cost is in fact high, but I don't
>>> have details saved and I'm afraid I don't have time to re-run those
>>> tests right now and compare numbers.
>>>
>>> Reproducing it:
>>>
>>> I made some changes and the test case should now hopefully be easy to
>>> run assuming you have maven installed. The github project is at:
>>>
>>> http://github.com/scode/httpgctest
>>>
>>> There is a README, but the shortest possible instructions to
>>> re-produce the test that I did:
>>>
>>> git clone git://github.com/scode/httpgctest.git
>>> cd httpgctest.git
>>> git checkout 20100714_1 # grab from appropriate tag, in case I
>>> change master
>>> mvn package
>>> HTTPGCTEST_LOGGC=gc.log ./run.sh
>>>
>>> That should start the http server; then run concurrently:
>>>
>>> while [ 1 ] ; do curl 'http://localhost:9191/dropdata?ratio=0.10' ;
>>> curl 'http://localhost:9191/gendata?amount=25000' ; sleep 0.1 ; done
>>>
>>> And then just wait and observe.
>>>
>>> Nature of the test:
>>>
>>> So the test if run as above will essentially reach a steady state of
>>> equilibrium with about 25000 pieces of data in a clojure immutable
>>> map. The result is that a significant amount of new data is being
>>> allocated, but very little writing to old regions is happening. The
>>> garbage generated is very well spread out over the entire heap because
>>> it goes through all objects and drops 10% (the ratio=0.10) for each
>>> iteration, after which it adds 25000 new items.
>>>
>>> In other words; not a lot of old gen writing, but lots of writes to
>>> the young gen referencing objects in the old gen.
>>>
>>> [1] http://distfiles.scode.org/mlref/g1/g1_region_live_stats_hack.patch
>>> [2]
>>> http://distfiles.scode.org/mlref/gctest/httpgctest-g1-fullgc-20100714/gc-fullfallback.log
>>> [3]
>>> http://distfiles.scode.org/mlref/gctest/httpgctest-g1-fullgc-20100714/vmoptions.txt
>>>
>>> --
>>> / Peter Schuller
>>>
>>
>>
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>
--
Todd Lipcon
Software Engineer, Cloudera
> Â - most of the time no old gen regions are selected for collection at all!
> Here's a graph of just old regions:
> http://people.apache.org/~todd/hbase-fragmentation/old-regions.png
This is consistent with my anecdotal observations as well and I
believe it is expected. What I have observed happening is that
non-young (partial) collections always happen after the marking phases
some number of times, followed by young collections only until another
marking phase is triggered and completed.
I think this makes sense because region selection is based on cost
heuristics largely based on liveness data from marking. So you have
your marking phase followed by a period of decreasing availability of
non-young regions that are eligible for collection given the GC
efficiency goals (and the pause time goals), until there are 0 such.
Young collections then continue until unrelated criteria trigger a new
marking phase, giving non-young regions a chance again to get above
the eligibility watermark.
--
/ Peter Schuller
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
|
# 18

23-01-2011 08:42 AM
|
|
|
Todd,
Could you send a segment of the GC logs from the beginning
through the first dozen or so full GC's?
Exactly which version of the JVM are you using?
java -version
will tell us.
Do you have a test setup where you could do some experiments?
Can you send the set of CMS flags you use? It might tell
us something about the GC behavior of you application.
Might not tell us anything but it's worth a look.
Jon
On 07/06/10 13:27, Todd Lipcon wrote:
> Hi all,
>
> I work on HBase, a distributed database written in Java. We generally
> run on large heaps (8GB+), and our object lifetime distribution has
> proven pretty problematic for garbage collection (we manage a multi-GB
> LRU cache inside the process, so in CMS we tenure a lot of byte arrays
> which later get collected).
>
> In Java6, we generally run with the CMS collector, tuning down
> CMSInitiatingOccupancyFraction and constraining MaxNewSize to achieve
> fairly low pause GC, but after a week or two of uptime we often run
> into full heap compaction which takes several minutes and wreaks havoc
> on the system.
>
> Needless to say, we've been watching the development of the G1 GC with
> anticipation for the last year or two. Finally it seems in the latest
> build of JDK7 it's stable enough for actual use (previously it would
> segfault within an hour or so). However, in my testing I'm seeing a
> fair amount of 8-10 second Full GC pauses.
>
> The flags I'm passing are:
>
> -Xmx8000m -XX:+UseG1GC -XX:MaxGCPauseMillis=20
> -XX:GCPauseIntervalMillis=80
>
> Most of the pauses I see in the GC log are around 10-20ms as expected:
>
> 2010-07-05T22:43:19.849-0700: 1680.079: [GC pause (partial),
> 0.01209000 secs]
> [Parallel Time: 10.5 ms]
> [Update RS (Start) (ms): 1680080.2 1680080.1 1680080.2
> 1680079.9 1680080.0 1680080.2 1680080.1 1680080.1 1680080.0
> 1680080.1 1680080.0 1680079.9 1680081.5]
> [Update RS (ms): 1.4 2.0 2.2 1.8 1.7 1.4 2.5 2.2 1.9
> 2.5 1.7 1.7 0.1
> Avg: 1.8, Min: 0.1, Max: 2.5]
> [Processed Buffers : 8 1 3 1 1 7 3 2 6 2 7 8 3
> Sum: 52, Avg: 4, Min: 1, Max: 8]
> [Ext Root Scanning (ms): 0.7 0.5 0.5 0.3 0.5 0.7 0.4 0.5
> 0.4 0.5 0.3 0.3 0.0
> Avg: 0.4, Min: 0.0, Max: 0.7]
> [Mark Stack Scanning (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0
> 0.0 0.0 0.0 0.0 0.0 0.0
> Avg: 0.0, Min: 0.0, Max: 0.0]
> [Scan RS (ms): 0.9 0.4 0.1 0.7 0.7 0.9 0.0 0.1 0.6 0.0
> 0.8 0.8 0.9
> Avg: 0.5, Min: 0.0, Max: 0.9]
> [Object Copy (ms): 7.2 7.2 7.3 7.3 7.1 7.1 7.0 7.2 7.1
> 6.9 7.1 7.1 7.0
> Avg: 7.1, Min: 6.9, Max: 7.3]
> [Termination (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
> 0.0 0.0 0.0 0.0
> Avg: 0.0, Min: 0.0, Max: 0.0]
> [Other: 0.6 ms]
> [Clear CT: 0.5 ms]
> [Other: 1.1 ms]
> [Choose CSet: 0.0 ms]
> [ 7677M->7636M(8000M)]
> [Times: user=0.12 sys=0.00, real=0.01 secs]
>
> But every 5-10 minutes I see a GC pause that lasts 10-15 seconds:
> [todd@monster01 logs]$ grep 'Full GC' gc-hbase.log | tail
> 2010-07-06T12:50:41.216-0700: 52521.446: [Full GC 7934M->4865M(8000M),
> 9.8907800 secs]
> 2010-07-06T12:55:39.802-0700: 52820.032: [Full GC 7930M->4964M(8000M),
> 9.9025520 secs]
> 2010-07-06T13:02:26.872-0700: 53227.102: [Full GC 7934M->4882M(8000M),
> 10.1232190 secs]
> 2010-07-06T13:09:41.049-0700: 53661.279: [Full GC 7938M->5002M(8000M),
> 10.4997760 secs]
> 2010-07-06T13:18:51.531-0700: 54211.761: [Full GC 7938M->4962M(8000M),
> 11.0497380 secs]
>
> These pauses are pretty unacceptable for soft real time operation.
>
> Am I missing some tuning that should be done for G1GC for applications
> like this? Is 20ms out of 80ms too aggressive a target for the garbage
> rates we're generating?
>
> My actual live heap usage should be very stable around 5GB - the
> application very carefully accounts its live object set at around 60%
> of the max heap (as you can see in the logs above).
>
> At this point we are considering doing crazy things like ripping out
> our main memory consumers into a custom slab allocator, and manually
> reference count the byte array slices. But, if we can get G1GC to work
> for us, it will save a lot of engineering on the application side!
>
> Thanks
> -Todd
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
> ------------------------------------------------------------------------
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
Did you try doubling the heap size? You might want to post a full
log so we can see what's happening between those full collections.
Also, If you have comparable CMS logs
all the better, as a known starting point. The full gc's almost
look like the heap got too full, so it must mean that incremental
collection is not keeping up with the rate of garbage generation.
Also, what's the JDK build you are running?
-- ramki
On 07/06/10 13:27, Todd Lipcon wrote:
> Hi all,
>
> I work on HBase, a distributed database written in Java. We generally
> run on large heaps (8GB+), and our object lifetime distribution has
> proven pretty problematic for garbage collection (we manage a multi-GB
> LRU cache inside the process, so in CMS we tenure a lot of byte arrays
> which later get collected).
>
> In Java6, we generally run with the CMS collector, tuning down
> CMSInitiatingOccupancyFraction and constraining MaxNewSize to achieve
> fairly low pause GC, but after a week or two of uptime we often run into
> full heap compaction which takes several minutes and wreaks havoc on the
> system.
>
> Needless to say, we've been watching the development of the G1 GC with
> anticipation for the last year or two. Finally it seems in the latest
> build of JDK7 it's stable enough for actual use (previously it would
> segfault within an hour or so). However, in my testing I'm seeing a fair
> amount of 8-10 second Full GC pauses.
>
> The flags I'm passing are:
>
> -Xmx8000m -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:GCPauseIntervalMillis=80
>
> Most of the pauses I see in the GC log are around 10-20ms as expected:
>
> 2010-07-05T22:43:19.849-0700: 1680.079: [GC pause (partial), 0.01209000
> secs]
> [Parallel Time: 10.5 ms]
> [Update RS (Start) (ms): 1680080.2 1680080.1 1680080.2
> 1680079.9 1680080.0 1680080.2 1680080.1 1680080.1 1680080.0
> 1680080.1 1680080.0 1680079.9 1680081.5]
> [Update RS (ms): 1.4 2.0 2.2 1.8 1.7 1.4 2.5 2.2 1.9 2.5
> 1.7 1.7 0.1
> Avg: 1.8, Min: 0.1, Max: 2.5]
> [Processed Buffers : 8 1 3 1 1 7 3 2 6 2 7 8 3
> Sum: 52, Avg: 4, Min: 1, Max: 8]
> [Ext Root Scanning (ms): 0.7 0.5 0.5 0.3 0.5 0.7 0.4 0.5
> 0.4 0.5 0.3 0.3 0.0
> Avg: 0.4, Min: 0.0, Max: 0.7]
> [Mark Stack Scanning (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
> 0.0 0.0 0.0 0.0 0.0
> Avg: 0.0, Min: 0.0, Max: 0.0]
> [Scan RS (ms): 0.9 0.4 0.1 0.7 0.7 0.9 0.0 0.1 0.6 0.0
> 0.8 0.8 0.9
> Avg: 0.5, Min: 0.0, Max: 0.9]
> [Object Copy (ms): 7.2 7.2 7.3 7.3 7.1 7.1 7.0 7.2 7.1
> 6.9 7.1 7.1 7.0
> Avg: 7.1, Min: 6.9, Max: 7.3]
> [Termination (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
> 0.0 0.0 0.0 0.0
> Avg: 0.0, Min: 0.0, Max: 0.0]
> [Other: 0.6 ms]
> [Clear CT: 0.5 ms]
> [Other: 1.1 ms]
> [Choose CSet: 0.0 ms]
> [ 7677M->7636M(8000M)]
> [Times: user=0.12 sys=0.00, real=0.01 secs]
>
> But every 5-10 minutes I see a GC pause that lasts 10-15 seconds:
> [todd@monster01 logs]$ grep 'Full GC' gc-hbase.log | tail
> 2010-07-06T12:50:41.216-0700: 52521.446: [Full GC 7934M->4865M(8000M),
> 9.8907800 secs]
> 2010-07-06T12:55:39.802-0700: 52820.032: [Full GC 7930M->4964M(8000M),
> 9.9025520 secs]
> 2010-07-06T13:02:26.872-0700: 53227.102: [Full GC 7934M->4882M(8000M),
> 10.1232190 secs]
> 2010-07-06T13:09:41.049-0700: 53661.279: [Full GC 7938M->5002M(8000M),
> 10.4997760 secs]
> 2010-07-06T13:18:51.531-0700: 54211.761: [Full GC 7938M->4962M(8000M),
> 11.0497380 secs]
>
> These pauses are pretty unacceptable for soft real time operation.
>
> Am I missing some tuning that should be done for G1GC for applications
> like this? Is 20ms out of 80ms too aggressive a target for the garbage
> rates we're generating?
>
> My actual live heap usage should be very stable around 5GB - the
> application very carefully accounts its live object set at around 60% of
> the max heap (as you can see in the logs above).
>
> At this point we are considering doing crazy things like ripping out our
> main memory consumers into a custom slab allocator, and manually
> reference count the byte array slices. But, if we can get G1GC to work
> for us, it will save a lot of engineering on the application side!
>
> Thanks
> -Todd
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>
>
> ------------------------------------------------------------------------
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
I also work with Todd on this systems (I am one of the other people
with the alternate CMS config) and doubling the heap size from 8 GB to
16GB is a little insane... we'd like to have some amount of reasonable
memory efficiency here... The thing is the more we can get out of our
ram for this block cache, the better performing our systems are. Also
a lot of the settings are self tuning, so if we up the Xmx the size of
the block cache is scaled as well.
-ryan
On Tue, Jul 6, 2010 at 2:12 PM, Y. S. Ramakrishna
<> wrote:
> Did you try doubling the heap size? You might want to post a full
> log so we can see what's happening between those full collections.
> Also, If you have comparable CMS logs
> all the better, as a known starting point. The full gc's almost
> look like the heap got too full, so it must mean that incremental
> collection is not keeping up with the rate of garbage generation.
> Also, what's the JDK build you are running?
>
> -- ramki
>
> On 07/06/10 13:27, Todd Lipcon wrote:
>> Hi all,
>>
>> I work on HBase, a distributed database written in Java. We generally
>> run on large heaps (8GB+), and our object lifetime distribution has
>> proven pretty problematic for garbage collection (we manage a multi-GB
>> LRU cache inside the process, so in CMS we tenure a lot of byte arrays
>> which later get collected).
>>
>> In Java6, we generally run with the CMS collector, tuning down
>> CMSInitiatingOccupancyFraction and constraining MaxNewSize to achieve
>> fairly low pause GC, but after a week or two of uptime we often run into
>> full heap compaction which takes several minutes and wreaks havoc on the
>> system.
>>
>> Needless to say, we've been watching the development of the G1 GC with
>> anticipation for the last year or two. Finally it seems in the latest
>> build of JDK7 it's stable enough for actual use (previously it would
>> segfault within an hour or so). However, in my testing I'm seeing a fair
>> amount of 8-10 second Full GC pauses.
>>
>> The flags I'm passing are:
>>
>> -Xmx8000m -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:GCPauseIntervalMillis=80
>>
>> Most of the pauses I see in the GC log are around 10-20ms as expected:
>>
>> 2010-07-05T22:43:19.849-0700: 1680.079: [GC pause (partial), 0.01209000
>> secs]
>> [Parallel Time: 10.5 ms]
>> [Update RS (Start) (ms): 1680080.2 1680080.1 1680080.2
>> 1680079.9 1680080.0 1680080.2 1680080.1 1680080.1 1680080.0
>> 1680080.1 1680080.0 1680079.9 1680081.5]
>> [Update RS (ms): 1.4 2.0 2.2 1.8 1.7 1.4 2.5 2.2 1.9 2.5
>> 1.7 1.7 0.1
>> Avg: 1.8, Min: 0.1, Max: 2.5]
>> [Processed Buffers : 8 1 3 1 1 7 3 2 6 2 7 8 3
>> Sum: 52, Avg: 4, Min: 1, Max: 8]
>> [Ext Root Scanning (ms): 0.7 0.5 0.5 0.3 0.5 0.7 0.4 0.5
>> 0.4 0.5 0.3 0.3 0.0
>> Avg: 0.4, Min: 0.0, Max: 0.7]
>> [Mark Stack Scanning (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
>> 0.0 0.0 0.0 0.0 0.0
>> Avg: 0.0, Min: 0.0, Max: 0.0]
>> [Scan RS (ms): 0.9 0.4 0.1 0.7 0.7 0.9 0.0 0.1 0.6 0.0
>> 0.8 0.8 0.9
>> Avg: 0.5, Min: 0.0, Max: 0.9]
>> [Object Copy (ms): 7.2 7.2 7.3 7.3 7.1 7.1 7.0 7.2 7.1
>> 6.9 7.1 7.1 7.0
>> Avg: 7.1, Min: 6.9, Max: 7.3]
>> [Termination (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
>> 0.0 0.0 0.0 0.0
>> Avg: 0.0, Min: 0.0, Max: 0.0]
>> [Other: 0.6 ms]
>> [Clear CT: 0.5 ms]
>> [Other: 1.1 ms]
>> [Choose CSet: 0.0 ms]
>> [ 7677M->7636M(8000M)]
>> [Times: user=0.12 sys=0.00, real=0.01 secs]
>>
>> But every 5-10 minutes I see a GC pause that lasts 10-15 seconds:
>> [todd@monster01 logs]$ grep 'Full GC' gc-hbase.log | tail
>> 2010-07-06T12:50:41.216-0700: 52521.446: [Full GC 7934M->4865M(8000M),
>> 9.8907800 secs]
>> 2010-07-06T12:55:39.802-0700: 52820.032: [Full GC 7930M->4964M(8000M),
>> 9.9025520 secs]
>> 2010-07-06T13:02:26.872-0700: 53227.102: [Full GC 7934M->4882M(8000M),
>> 10.1232190 secs]
>> 2010-07-06T13:09:41.049-0700: 53661.279: [Full GC 7938M->5002M(8000M),
>> 10.4997760 secs]
>> 2010-07-06T13:18:51.531-0700: 54211.761: [Full GC 7938M->4962M(8000M),
>> 11.0497380 secs]
>>
>> These pauses are pretty unacceptable for soft real time operation.
>>
>> Am I missing some tuning that should be done for G1GC for applications
>> like this? Is 20ms out of 80ms too aggressive a target for the garbage
>> rates we're generating?
>>
>> My actual live heap usage should be very stable around 5GB - the
>> application very carefully accounts its live object set at around 60% of
>> the max heap (as you can see in the logs above).
>>
>> At this point we are considering doing crazy things like ripping out our
>> main memory consumers into a custom slab allocator, and manually
>> reference count the byte array slices. But, if we can get G1GC to work
>> for us, it will save a lot of engineering on the application side!
>>
>> Thanks
>> -Todd
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
>>
>> ------------------------------------------------------------------------
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
Two questions:
(1) do you enable class unloading with CMS? (I do not see that
below in yr option list, but wondered.)
(2) does your application load classes, or intern a lot of strings?
If i am reading the logs right, G1 appears to reclaim less
and less of the heap in each cycle until a full collection
intervenes, and I have no real explanation for this behaviour
except that perhaps there's something in the perm gen that
keeps stuff in the remainder of the heap artificially live.
G1 does not incrementally collect the young gen, so this is
plausible. But CMS does not either by default and I do not see
that option in the CMS options list you gave below. It would
be instructive to see what the comparable CMS logs look like.
May be then you could start with the same heap shapes for the
two and see if you can get to the bottom of the full gc (which
as i understand you get to more quickly w/G1 than you did
w/CMS).
-- ramki
On 07/06/10 14:24, Todd Lipcon wrote:
> On Tue, Jul 6, 2010 at 2:09 PM, Jon Masamitsu <
> > wrote:
>
> Todd,
>
> Could you send a segment of the GC logs from the beginning
> through the first dozen or so full GC's?
>
>
> Sure, I just put it online at:
>
> http://cloudera-todd.s3.amazonaws.com/gc-log-g1gc.txt
>
>
>
> Exactly which version of the JVM are you using?
>
> java -version
>
> will tell us.
>
>
> Latest as of last night:
>
> [todd@monster01 ~]$ ./jdk1.7.0/jre/bin/java -version
> java version "1.7.0-ea"
> Java(TM) SE Runtime Environment (build 1.7.0-ea-b99)
> Java HotSpot(TM) 64-Bit Server VM (build 19.0-b03, mixed mode)
>
>
> Do you have a test setup where you could do some experiments?
>
>
> Sure, I have a five node cluster here where I do lots of testing, happy
> to try different builds/options/etc (though I probably don't have time
> to apply patches and rebuild the JDK myself)
>
>
> Can you send the set of CMS flags you use? It might tell
> us something about the GC behavior of you application.
> Might not tell us anything but it's worth a look.
>
>
> Different customers have found different flags to work well for them.
> One user uses the following:
>
>
> -Xmx12000m -XX:+HeapDumpOnOutOfMemoryError -XX:+UseConcMarkSweepGC \
> -XX:NewSize=64m -XX:MaxNewSize=64m -XX:CMSInitiatingOccupancyFraction=40 \
>
>
> Another uses:
>
>
> -XX:+DoEscapeAnalysis -XX:+AggressiveOpts -XX:+UseConcMarkSweepGC
> -XX:NewSize=64m -XX:MaxNewSize=64m
> -XX:CMSInitiatingOccupancyFraction=88 -verbose:gc -XX:+PrintGCDetails
>
>
>
>
>
>
>
> The particular tuning options probably depend on the actual cache
> workload of the user. I tend to recommend CMSInitiatingOccupancyFraction
> around 75 or so, since the software maintains about 60% heap usage. I
> also think a NewSize slightly larger would improve things a bit, but if
> it gets more than 256m or so, the ParNew pauses start to be too long for
> a lot of use cases.
>
> Regarding CMS logs, I can probably restart this test later this
> afternoon on CMS and run it for a couple hours, but it isn't likely to
> hit the multi-minute compaction that quickly. It happens more in the wild.
>
> -Todd
>
>
>
> On 07/06/10 13:27, Todd Lipcon wrote:
>> Hi all,
>>
>> I work on HBase, a distributed database written in Java. We
>> generally run on large heaps (8GB+), and our object lifetime
>> distribution has proven pretty problematic for garbage collection
>> (we manage a multi-GB LRU cache inside the process, so in CMS we
>> tenure a lot of byte arrays which later get collected).
>>
>> In Java6, we generally run with the CMS collector, tuning down
>> CMSInitiatingOccupancyFraction and constraining MaxNewSize to
>> achieve fairly low pause GC, but after a week or two of uptime we
>> often run into full heap compaction which takes several minutes
>> and wreaks havoc on the system.
>>
>> Needless to say, we've been watching the development of the G1 GC
>> with anticipation for the last year or two. Finally it seems in
>> the latest build of JDK7 it's stable enough for actual use
>> (previously it would segfault within an hour or so). However, in
>> my testing I'm seeing a fair amount of 8-10 second Full GC pauses.
>>
>> The flags I'm passing are:
>>
>> -Xmx8000m -XX:+UseG1GC -XX:MaxGCPauseMillis=20
>> -XX:GCPauseIntervalMillis=80
>>
>> Most of the pauses I see in the GC log are around 10-20ms as expected:
>>
>> 2010-07-05T22:43:19.849-0700: 1680.079: [GC pause (partial),
>> 0.01209000 secs]
>> [Parallel Time: 10.5 ms]
>> [Update RS (Start) (ms): 1680080.2 1680080.1 1680080.2
>> 1680079.9 1680080.0 1680080.2 1680080.1 1680080.1 1680080.0
>> 1680080.1 1680080.0 1680079.9 1680081.5]
>> [Update RS (ms): 1.4 2.0 2.2 1.8 1.7 1.4 2.5 2.2
>> 1.9 2.5 1.7 1.7 0.1
>> Avg: 1.8, Min: 0.1, Max: 2.5]
>> [Processed Buffers : 8 1 3 1 1 7 3 2 6 2 7 8 3
>> Sum: 52, Avg: 4, Min: 1, Max: 8]
>> [Ext Root Scanning (ms): 0.7 0.5 0.5 0.3 0.5 0.7 0.4
>> 0.5 0.4 0.5 0.3 0.3 0.0
>> Avg: 0.4, Min: 0.0, Max: 0.7]
>> [Mark Stack Scanning (ms): 0.0 0.0 0.0 0.0 0.0 0.0
>> 0.0 0.0 0.0 0.0 0.0 0.0 0.0
>> Avg: 0.0, Min: 0.0, Max: 0.0]
>> [Scan RS (ms): 0.9 0.4 0.1 0.7 0.7 0.9 0.0 0.1 0.6
>> 0.0 0.8 0.8 0.9
>> Avg: 0.5, Min: 0.0, Max: 0.9]
>> [Object Copy (ms): 7.2 7.2 7.3 7.3 7.1 7.1 7.0 7.2
>> 7.1 6.9 7.1 7.1 7.0
>> Avg: 7.1, Min: 6.9, Max: 7.3]
>> [Termination (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
>> 0.0 0.0 0.0 0.0 0.0
>> Avg: 0.0, Min: 0.0, Max: 0.0]
>> [Other: 0.6 ms]
>> [Clear CT: 0.5 ms]
>> [Other: 1.1 ms]
>> [Choose CSet: 0.0 ms]
>> [ 7677M->7636M(8000M)]
>> [Times: user=0.12 sys=0.00, real=0.01 secs]
>>
>> But every 5-10 minutes I see a GC pause that lasts 10-15 seconds:
>> [todd@monster01 logs]$ grep 'Full GC' gc-hbase.log | tail
>> 2010-07-06T12:50:41.216-0700: 52521.446: [Full GC
>> 7934M->4865M(8000M), 9.8907800 secs]
>> 2010-07-06T12:55:39.802-0700: 52820.032: [Full GC
>> 7930M->4964M(8000M), 9.9025520 secs]
>> 2010-07-06T13:02:26.872-0700: 53227.102: [Full GC
>> 7934M->4882M(8000M), 10.1232190 secs]
>> 2010-07-06T13:09:41.049-0700: 53661.279: [Full GC
>> 7938M->5002M(8000M), 10.4997760 secs]
>> 2010-07-06T13:18:51.531-0700: 54211.761: [Full GC
>> 7938M->4962M(8000M), 11.0497380 secs]
>>
>> These pauses are pretty unacceptable for soft real time operation.
>>
>> Am I missing some tuning that should be done for G1GC for
>> applications like this? Is 20ms out of 80ms too aggressive a
>> target for the garbage rates we're generating?
>>
>> My actual live heap usage should be very stable around 5GB - the
>> application very carefully accounts its live object set at around
>> 60% of the max heap (as you can see in the logs above).
>>
>> At this point we are considering doing crazy things like ripping
>> out our main memory consumers into a custom slab allocator, and
>> manually reference count the byte array slices. But, if we can get
>> G1GC to work for us, it will save a lot of engineering on the
>> application side!
>>
>> Thanks
>> -Todd
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>> ------------------------------------------------------------------------
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
On 07/07/10 08:45, Todd Lipcon wrote:
...
>
> Overnight I saw one "concurrent mode failure".
...
> 2010-07-07T07:56:27.786-0700: 28490.203: [GC 28490.203: [ParNew
> (promotion failed): 59008K->59008K(59008K), 0.0179250 secs]28490.221:
> [CMS2010-07-07T07:56:27.901-0700: 28490.317: [CMS-concurrent-preclean:
> 0.556/0.947 secs] [Times:
> user=5.76 sys=0.26, real=0.95 secs]
> (concurrent mode failure): 6359176K->4206871K(8323072K), 17.4366220
> secs] 6417373K->4206871K(8382080K), [CMS Perm : 18609K->18565K(31048K)],
> 17.4546890 secs] [Times: user=11.17 sys=0.09, real=17.45 secs]
>
> I've interpreted pauses like this as being caused by fragmentation,
> since the young gen is 64M, and the old gen here has about 2G free. If
> there's something I'm not understanding about CMS, and I can tune it
> more smartly to avoid these longer pauses, I'm happy to try.
Yes the old gen must be fragmented. I'll look at the data you have
made available (for CMS). The CMS log you uploaded does not have the
suffix leading into the concurrent mode failure ypu display above
(it stops less than 2500 s into the run). If you could include
the entire log leading into the concurrent mode failures, it would
be a great help. Do you have large arrays in your
application? The shape of the promotion graph for CMS is somewhat
jagged, indicating _perhaps_ that. Yes, +PrintTenuringDistribution
would shed a bit more light. As regards fragmentation, it can be
tricky to tune against, but we can try once we understand a bit
more about the object sizes and demographics.
I am sure you don't have an easily shared test case, so we
can reproduce both the CMS fragmentation and the G1 full gc
issues locally for quickest progress on this?
-- ramki
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
On 07/07/10 17:18, Todd Lipcon wrote:
...
> Looking at the graph you attached, it appears that the low-water mark
> stabilizes at somewhere between 4.5G and 5G. The configuration I'm
> running is to allocate 40% of the heap to Memstore and 20% of the heap
> to the LRU cache. For an 8G heap, this is 4.8GB. So, for this
> application it's somewhat expected that, as it runs, it will accumulate
> more and more data until it reaches this threshold. The data is, of
> course, not *permanent*, but it's reasonably long-lived, so it makes
> sense to me that it should go into the old generation.
Ah, i see. In that case, i think you could try using a slightly larger old
gen. If the old gen stabilizes at 4.2 GB, we should allow as much for slop.
i.e. make the old gen 8.4 GB (or whatever is the measured stable
old gen occupancy), then add to that the young gen size, and use
that for the whole heap. I would be even more aggressive
and grant more to the old gen -- as i said earlier perhaps
double the old gen from its present size. If that doesn;t work
we know that something is amiss in the way we are going at this.
If it works, we can iterate downwards from a config that we know
works, down to what may be considered an acceptable space overhead
for GC.
>
> If you like, I can tune down those percentages to 20/20 instead of
> 20/40, and I think we'll see the same pattern, just stabilized around
> 3.2GB. This will probably delay the full GCs, but still eventually hit
> them. It's also way lower than we can really go - customers won't like
> "throwing away" 60% of the allocated heap to GC!
I understand that sentiment. I want us to get to a state where we are able
to completely avoid the creeping fragmentation, if possible. There are
other ways to tune for this, but they are more labour-intensive and tricky,
and I would not want to go into that lightly. You might want to contact
your Java support for help with that.
>
>
> Perhaps jmap -histo:live or +PrintClassHistogram[AfterFullGC]
> will help you get to the bottom of yr leak. Once the leak is plugged
> perhaps we could come back to the G1 tuning effort? (We have some
> guesses as to what might be happening and the best G1 minds are
> chewing on the info you provided so far, for which thanks!)
>
>
> I can try running with those options and see what I see, but I've
> already spent some time looking at heap dumps, and not found any leaks,
> so I'm pretty sure it's not the issue.
OK, in that case it's not worth doing, since you've already ruled
out leaks.
I'll think some more about this meanwhile.
thanks.
-- ramki
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
On 07/07/10 17:32, Todd Lipcon wrote:
...
> OK, I can try some tests with cache configured for only 40% heap usage.
> Should I run these tests with CMS or G1?
I'd first try CMS, and if that works, try G1.
>
>
>
>
> If you like, I can tune down those percentages to 20/20 instead
> of 20/40, and I think we'll see the same pattern, just
> stabilized around 3.2GB. This will probably delay the full GCs,
> but still eventually hit them. It's also way lower than we can
> really go - customers won't like "throwing away" 60% of the
> allocated heap to GC!
>
>
> I understand that sentiment. I want us to get to a state where we
> are able
> to completely avoid the creeping fragmentation, if possible. There are
> other ways to tune for this, but they are more labour-intensive and
> tricky,
> and I would not want to go into that lightly. You might want to contact
> your Java support for help with that.
>
>
> Yep, we've considered various solutions involving managing our own
> ref-counted slices of a single pre-allocated byte array - essentially
> writing our own slab allocator. In theory this should make all of the
> GCable objects constrained to a small number of sizes, and thus prevent
> fragmentation, but it's quite a project to undertake :)
That would be overdoing it. I didn't mean anything so drastic and certainly
nothing so drastic at the application level. When I said "labour intensive"
I meant tuning GC to avoid that kind of fragmentation would be more work.
>
> Regarding Java support, as an open source project we have no such
> luxury. Projects like HBase and Hadoop, though, are pretty visible to
> users as "big Java apps", so getting them working well on the GC front
> does good things for Java adoption in the database/distributed systems
> community, I think.
I agree, and we certainly should.
-- ramki
>
> -Todd
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>
>
> ------------------------------------------------------------------------
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
> Am I missing some tuning that should be done for G1GC for applications like
> this? Is 20ms out of 80ms too aggressive a target for the garbage rates
> we're generating?
I have never run HBase, but in an LRU stress test (I posted about it a
few months ago) I specifically observed remembered set scanning costs
go way up. In addition I was seeing fallbacks to full GC:s recently in
a slightly different test that I also posed about to -use, and that
turned out to be a result of the estimated rset scanning costs being
so high that regions were never selected for eviction even though they
had very little live data. I would be very interested to hear if
you're having the same problem. My last post on the topic is here:
http://mail.openjdk.java.net/pipermail/hotspot-gc-use/2010-June/000652.html
Including the link to the (throw-away) patch that should tell you
whether this is what's happening:
http://distfiles.scode.org/mlref/g1/g1_region_live_stats_hack.patch
Out of personal curiosity I'd be very interested to hear whether this
is what's happening to you (in a real reasonable use-case rather than
a synthetic benchmark).
My sense (and hotspot/g1 developers please smack me around if I am
misrepresenting anything here) is that the effect I saw (with rset
scanning costs) could cause perpetual memory grow (until fallback to
full GC) in two ways:
(1) The estimated (and possibly real) cost of rset scanning for a
single region could be so high that it is never possible to select it
for eviction given the asked for pause time goals. Hence, such a
region effectively "leaks" until full GC.
(2) The estimated (and possibly real) cost of rset scanning for
regions may be so high that there are, in practice, always other
regions selected for high pay-off/cost ratios, such that they end up
never being collected even if theoretically a single region could be
evicted within the pause time goal.
These are effectively the same thing, with (1) being an extreme case of (2).
In both cases, the effect should be mitigated (and have been in the
case where I did my testing), but as far as I can tell not generally
"fixed", by increasing the pause time goals.
It is unclear to me how this is intended to be handled. The original
g1 paper mentions an rset scanning thread that I may suspect would be
intended to help do rset scanning in the background such that regions
like these could be evicted more cheaply during the STW eviction
pause; but I didn't find such a thread anywhere in the source code -
but I may very well just be missing it.
--
/ Peter Schuller
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
Hi Peter --
Yes, my guess was also that something (possibly along the lines
you stated below) was preventing the selection of certain (sets
of) regions for evacuation on a regular basis ... I am told there
are flags that will allow you to get verbose details on what is
or is not selected for inclusion in the collection set; perhaps
that will help you get down to the bottom of this. Did you say
you had a test case that showed this behaviour? Filing a bug
with that test case may be the quickest way to get this before
the right set of eyes. Over to the G1 cognoscenti.
-- ramki
On 07/12/10 09:02, Peter Schuller wrote:
>> Am I missing some tuning that should be done for G1GC for applications like
>> this? Is 20ms out of 80ms too aggressive a target for the garbage rates
>> we're generating?
>
> I have never run HBase, but in an LRU stress test (I posted about it a
> few months ago) I specifically observed remembered set scanning costs
> go way up. In addition I was seeing fallbacks to full GC:s recently in
> a slightly different test that I also posed about to -use, and that
> turned out to be a result of the estimated rset scanning costs being
> so high that regions were never selected for eviction even though they
> had very little live data. I would be very interested to hear if
> you're having the same problem. My last post on the topic is here:
>
> http://mail.openjdk.java.net/pipermail/hotspot-gc-use/2010-June/000652.html
>
> Including the link to the (throw-away) patch that should tell you
> whether this is what's happening:
>
> http://distfiles.scode.org/mlref/g1/g1_region_live_stats_hack.patch
>
> Out of personal curiosity I'd be very interested to hear whether this
> is what's happening to you (in a real reasonable use-case rather than
> a synthetic benchmark).
>
> My sense (and hotspot/g1 developers please smack me around if I am
> misrepresenting anything here) is that the effect I saw (with rset
> scanning costs) could cause perpetual memory grow (until fallback to
> full GC) in two ways:
>
> (1) The estimated (and possibly real) cost of rset scanning for a
> single region could be so high that it is never possible to select it
> for eviction given the asked for pause time goals. Hence, such a
> region effectively "leaks" until full GC.
>
> (2) The estimated (and possibly real) cost of rset scanning for
> regions may be so high that there are, in practice, always other
> regions selected for high pay-off/cost ratios, such that they end up
> never being collected even if theoretically a single region could be
> evicted within the pause time goal.
>
> These are effectively the same thing, with (1) being an extreme case of (2).
>
> In both cases, the effect should be mitigated (and have been in the
> case where I did my testing), but as far as I can tell not generally
> "fixed", by increasing the pause time goals.
>
> It is unclear to me how this is intended to be handled. The original
> g1 paper mentions an rset scanning thread that I may suspect would be
> intended to help do rset scanning in the background such that regions
> like these could be evicted more cheaply during the STW eviction
> pause; but I didn't find such a thread anywhere in the source code -
> but I may very well just be missing it.
>
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
Peter and Todd,
Any chance of setting +PrintHeapAtGC and -XX:+PrintHeapAtGCExtended and
sending us the log, or part of it (say between two Full GCs)? Be
prepared: this will generate piles of output. But it will give us
per-region information that might shed more light on the cause of the
issue.... thanks,
Tony, HS GC Group
Peter Schuller wrote:
>> Am I missing some tuning that should be done for G1GC for applications like
>> this? Is 20ms out of 80ms too aggressive a target for the garbage rates
>> we're generating?
>>
>
> I have never run HBase, but in an LRU stress test (I posted about it a
> few months ago) I specifically observed remembered set scanning costs
> go way up. In addition I was seeing fallbacks to full GC:s recently in
> a slightly different test that I also posed about to -use, and that
> turned out to be a result of the estimated rset scanning costs being
> so high that regions were never selected for eviction even though they
> had very little live data. I would be very interested to hear if
> you're having the same problem. My last post on the topic is here:
>
> http://mail.openjdk.java.net/pipermail/hotspot-gc-use/2010-June/000652.html
>
> Including the link to the (throw-away) patch that should tell you
> whether this is what's happening:
>
> http://distfiles.scode.org/mlref/g1/g1_region_live_stats_hack.patch
>
> Out of personal curiosity I'd be very interested to hear whether this
> is what's happening to you (in a real reasonable use-case rather than
> a synthetic benchmark).
>
> My sense (and hotspot/g1 developers please smack me around if I am
> misrepresenting anything here) is that the effect I saw (with rset
> scanning costs) could cause perpetual memory grow (until fallback to
> full GC) in two ways:
>
> (1) The estimated (and possibly real) cost of rset scanning for a
> single region could be so high that it is never possible to select it
> for eviction given the asked for pause time goals. Hence, such a
> region effectively "leaks" until full GC.
>
> (2) The estimated (and possibly real) cost of rset scanning for
> regions may be so high that there are, in practice, always other
> regions selected for high pay-off/cost ratios, such that they end up
> never being collected even if theoretically a single region could be
> evicted within the pause time goal.
>
> These are effectively the same thing, with (1) being an extreme case of (2).
>
> In both cases, the effect should be mitigated (and have been in the
> case where I did my testing), but as far as I can tell not generally
> "fixed", by increasing the pause time goals.
>
> It is unclear to me how this is intended to be handled. The original
> g1 paper mentions an rset scanning thread that I may suspect would be
> intended to help do rset scanning in the background such that regions
> like these could be evicted more cheaply during the STW eviction
> pause; but I didn't find such a thread anywhere in the source code -
> but I may very well just be missing it.
>
>
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
Ramki/Tony,
> Any chance of setting +PrintHeapAtGC and -XX:+PrintHeapAtGCExtended and
> sending us the log, or part of it (say between two Full GCs)? Be prepared:
> this will generate piles of output. But it will give us per-region
> information that might shed more light on the cause of the issue.... thanks,
So what I have in terms of data is (see footnotes for urls references in []):
(a) A patch[1] that prints some additional information about estimated
costs of region eviction, and disables the GC efficiency check that
normally terminates selection of regions. (Note: This is a throw-away
patch for debugging; it's not intended as a suggested change for
inclusion.)
(b) A log[2] showing the output of a test run I did just now, with
both your flags above and my patch enabled (but without disabling the
efficiency check). It shows fallback to full GC when the actual live
set size is 252 MB, and the maximum heap size is 2 GB (in other words,
~ 12% liveness). An easy way to find the point of full gc is to search
for the string 'full 1'.
(c) A file[3] with the effective VM options during the test.
(d) Instructions for how to run the test to reproduce it (I'll get to
that at the end; it's simplified relative to previously).
(e) Nature of the test.
Discussion:
WIth respect to region information: I originally tried it in response
to your recommendation earlier, but I found I did not see the
information I was after. Perhaps I was just misreading it, but I
mostly just saw either 0% or 100% fullness, and never the actual
liveness estimate as produced by the mark phase. In the log I am
referring to in this E-Mail, you can see that the last printout of
region information just before the live GC fits this pattern; I just
don't see anything that looks like legitimate liveness information
being printed. (I don't have time to dig back into it right now to
double-check what it's printing.)
If you scroll up from the point of the full gc until you find a bunch
of output starting with "predict_region_elapsed_time_ms" you see some
output resulting from the patch, with pretty extreme values such as:
predict_region_elapsed_time_ms: 34.378642ms total, 34.021154ms rs scan
(46542 cnum), 0.040069 copy time (20704 bytes), 0.317419 other time
predict_region_elapsed_time_ms: 45.020866ms total, 44.653222ms rs scan
(61087 cnum), 0.050225 copy time (25952 bytes), 0.317419 other time
predict_region_elapsed_time_ms: 16.250033ms total, 15.887065ms rs scan
(21734 cnum), 0.045550 copy time (23536 bytes), 0.317419 other time
predict_region_elapsed_time_ms: 226.942877ms total, 226.559163ms rs
scan (309940 cnum), 0.066296 copy time (34256 bytes), 0.317419 other
time
predict_region_elapsed_time_ms: 542.344828ms total, 541.954750ms rs
scan (741411 cnum), 0.072659 copy time (37544 bytes), 0.317419 other
time
predict_region_elapsed_time_ms: 668.595597ms total, 668.220877ms rs
scan (914147 cnum), 0.057301 copy time (29608 bytes), 0.317419 other
time
So in the most extreme case in the excerpt above, that's > half a
second of estimate rset scanning time for a single region with 914147
cards to be scanned. While not all are that extreme, lots and lots of
regions are very expensive and almost only due to rset scanning costs.
If you scroll down a bit to the first (and ONLY) partial that happened
after the statistics accumulating from the marking phase, we see more
output resulting form the patch. At the end, we see:
(picked region; 0.345890ms predicted; 1.317244ms remaining; 15kb
marked; 15kb maxlive; 1-1% liveness)
(393380 KB left in heap.)
(picked region; 0.345963ms predicted; 0.971354ms remaining; 15kb
marked; 15kb maxlive; 1-1% liveness)
(393365 KB left in heap.)
(picked region; 0.346036ms predicted; 0.625391ms remaining; 15kb
marked; 15kb maxlive; 1-1% liveness)
(393349 KB left in heap.)
(no more marked regions; next region too expensive (adaptive;
predicted 0.346036ms > remaining 0.279355ms))
So in other words, it picked a bunch of regions in order of "lowest
hanging fruit". The *least* low hanging fruit picked still had
liveness at 1%; in other words, there's plenty of further regions that
ideally should be collected because they contain almost no garbage
(ignoring the cost of collecting them).
In this case, it stopped picking regions because the next region to be
picked, though cheap, was the straw that broke the camel's neck and we
simply exceeded the alloted time for this particular GC.
However, after this partial completes, it reverts back to doing just
young gc:s. In other words, even though there's *plenty* of regions
with very low liveness, further partials aren't happening.
By applying this part of the patch:
- (adaptive_young_list_length() &&
+ (adaptive_young_list_length() && false && // scodetodo
I artificially force g1 to not fall back to doing young gc:s for
efficiency reasons. When I run with that change, I don't experience
the slow perpetual growth until fallback to full GC. If I remember
correctly though, the rset scanning cost is in fact high, but I don't
have details saved and I'm afraid I don't have time to re-run those
tests right now and compare numbers.
Reproducing it:
I made some changes and the test case should now hopefully be easy to
run assuming you have maven installed. The github project is at:
http://github.com/scode/httpgctest
There is a README, but the shortest possible instructions to
re-produce the test that I did:
git clone git://github.com/scode/httpgctest.git
cd httpgctest.git
git checkout 20100714_1 # grab from appropriate tag, in case I
change master
mvn package
HTTPGCTEST_LOGGC=gc.log ./run.sh
That should start the http server; then run concurrently:
while [ 1 ] ; do curl 'http://localhost:9191/dropdata?ratio=0.10' ;
curl 'http://localhost:9191/gendata?amount=25000' ; sleep 0.1 ; done
And then just wait and observe.
Nature of the test:
So the test if run as above will essentially reach a steady state of
equilibrium with about 25000 pieces of data in a clojure immutable
map. The result is that a significant amount of new data is being
allocated, but very little writing to old regions is happening. The
garbage generated is very well spread out over the entire heap because
it goes through all objects and drops 10% (the ratio=0.10) for each
iteration, after which it adds 25000 new items.
In other words; not a lot of old gen writing, but lots of writes to
the young gen referencing objects in the old gen.
[1] http://distfiles.scode.org/mlref/g1/g1_region_live_stats_hack.patch
[2] http://distfiles.scode.org/mlref/gctest/httpgctest-g1-fullgc-20100714/gc-fullfallback.log
[3] http://distfiles.scode.org/mlref/gctest/httpgctest-g1-fullgc-20100714/vmoptions.txt
--
/ Peter Schuller
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
> I consistently am seeing "Other time" estimates in the >40ms range. Given my
> pause time goal of 20ms, these estimates are I think excluding most of the
> regions from collectability. I haven't been able to dig around yet to figure
> out where the long estimate for "other time" is coming from - in the
> collections logged it sometimes shows fairly high "Other" but the "Choose
> CSet" component is very short.
(The following is wannabe speculation based on limited understanding
of the code, please take it with a grain of salt.)
My first thought here is swapping. My reading is that other time is
going to be the collection set selection time plus the collection set
free time (or at least intended to be). I think (am I wrong?) that
this should be really low under normal circumstances since no "bulk"
work is done really; in particular the *per-region* cost should be
low.
If the cost of these operations *per region* ended up being predicted
to > 40ms, I wonder if this was not due to swapping?
Additionally: As far as I can tell the estimated 'other' cost is based
on a history of the cost from previous GC:s and completely independent
of the particular region being evaluated.
Anyways, I suspect you've already confirmed that the system is not
actively swapping at the time of the fallback to full GC. But here is
one low-confidence hypothesis (it would be really great to hear from
one of the gc devs whether it is even remotely plausible):
* At some point in time, there was swapping happening affecting GC
operations such that the work done do gather stats and select regions
was slow (makes some sense since that should touch lots of distinct
regions and you don't need a lot of those memory accesses swapping to
accumulate quite a bit of time).
* This screwed up the 'other' cost history and thus the prediction,
possibly for both young and non-young regions.
* I believe young collections would never be entirely prevented due to
pause time goals, so here the cost history and thus predictions would
always have time to recover and you would not notice any effect
looking at the behavior of the system down the line.
* Non-young "other" cost was so high that non-young regions were never
selected. This in turn meant that additional cost history for the
"other" category was never recorded, preventing recovery from the
temporary swap storm.
* The end result is that no non-young regions are ever collected, and
you end up falling back to full GC once the young collections have
"leaked" enough garbage.
Thoughts, anyone?
--
/ Peter Schuller
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
> I consistently am seeing "Other time" estimates in the >40ms range. Given my
> pause time goal of 20ms, these estimates are I think excluding most of the
Btw, to test the hypothesis: When you say "constantly", are the times
in fact so consistent that it's either exactly the same or almost,
possibly being consistent with my proposed hypothesis that the
non-young "other" time is stuck? If the young other time is not stuck
I guess one might see some variation (I seem to get < 1 ms on my
machine) but not a lot at all in comparison to 40ms. If you're seeing
variation like 40-42 all the time, and it never decreasing
significantly after it reached the 40ms range, that would be
consistent with the hypothesis I believe.
--
/ Peter Schuller
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
> There shouldn't be any swapping during the tests - I've got RAM fairly
> carefully allocated and I believe swappiness was tuned down on those
> machines, though I will double check to be certain.
Does HBase mmap() significant amounts of memory for I/O purposes? I'm
not very familiar with HBase and a quick Googling didn't yield an
answer.
With extensive mmap():ed I/O, excessive swapping of the application
seems to be a common problem even with significant memory margins,
sometimes even with swapiness turned down to 0. I've seen it happen
under several circumstances, and based on reports on the
cassandra-user mailing list during the past couple of months it seems
I'm not alone.
To be sure I recommend checking actual swapping history (or at least
check that the absolute amount of memory swapped out is reasonable
over time).
> I'll try to read through your full email in detail while looking at the
> source and the G1 paper -- right now it's a bit above my head :)
Well, just to re-iterate though I have really only begun looking at it
myself and my ramblings may be completely off the mark.
> FWIW, my tests on JRockit JRRT's gcprio:deterministic collector didn't go
> much better - eventually it fell back to a full compaction which lasted 45
> seconds or so. HBase must be doing something that's really hard for GCs to
> deal with - either on the heuristics front or on the allocation pattern
> front.
Interesting. I don't know a lot about JRockit's implementation since
not a lot of information seems to be available. I did my LRU
micro-benchmark with a ~20-30 GB heap and JRockit. I could definitely
press it hard enough to cause a fallback, but that seemed to be
directly as a result of high allocation rates simply exceeding the
forward progress made by the GC (based on blackbox observation
anyway).
(The other problem was that the compaction pauses were never able to
complete; it seems compaction is O(n) with respect to the number of
objects being compacted, and I was unable to make it compact less than
1% per GC (because the command line option only accepted integral
percents), and with my object count the 1% was enough to hit the pause
time requirement so compaction was aborted every time. LIkely this
would have poor results over time as fragmentation becomes
significant.).
Does HBase go into periodic modes of very high allocation rate, or is
it fairly constant over time? I'm thinking that perhaps the concurrent
marking is just not triggered early enough and if large bursts of
allocations happen when the heap is relatively full, that might be the
triggering factor?
--
/ Peter Schuller
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
> Yep, I've seen JRRT also "abort compaction" on most compactions. I couldn't
> quite figure out how to tell it that it was fine to pause more often for
> compaction, so long as each pause was short.
FWIW, I got the impression at the time (but I don't remember why; I
think I was half-guessing based on assumptions about what it does and
several iterations through the documentation) that it was
fundamentally only *able* to do compaction during the stop-the-world
pause after a concurrent mark phase. I.e., I don't think you can make
it spread the work out (but I can most definitely be wrong).
--
/ Peter Schuller
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
A bit more data. I did the following patch:
@@ -1560,6 +1575,19 @@
_non_young_other_cost_per_region_ms_seq->add(non_young_other_time_ms
/
(double)
_recorded_non_young_regions);
+ } else {
+ // no non-young gen collections - if our prediction is high enough,
we would
+ // never collect non-young again, so push it back towards zero so we
give it
+ // another try.
+ double predicted_other_time = predict_non_young_other_time_ms(1);
+ if (predicted_other_time > MaxGCPauseMillis/2.0) {
+ if (G1PolicyVerbose > 0) {
+ gclog_or_tty->print_cr(
+ "Predicted non-young other time %.1f is too large compared to
max pause time. Weighting down.",
+ predicted_other_time);
+ }
+ _non_young_other_cost_per_region_ms_seq->add(0.0);
+ }
}
and this mostly solved the problem described above. Now I get a full GC
every 45-50 minutes which is way improved from what it was before.
I still seem to be putting off GC of non-young regions too much though. I
did some analysis of the G1 log and made these graphs:
http://people.apache.org/~todd/hbase-fragmentation/g1-graphing.png
The top graph is a heat map of the number of young (pink color) and
non-young (blue) in each collection.
The middle graph is the post-collection heap usage over time in MB
The bottom graph is a heat map and smoothed line graph of the number of
millis spent per collection. The target in this case is 50ms.
A few interesting things:
- not sure what causes the sort of periodic striated pattern in the number
of young generation regions chosen
- most of the time no old gen regions are selected for collection at all!
Here's a graph of just old regions:
http://people.apache.org/~todd/hbase-fragmentation/old-regions.png
- When old regions are actually selected for collection the heap usage does
drop, though elapsed time does spike over the guarantee.
So seems like something about the heuristics aren't quite right. Thoughts?
-Todd
On Fri, Jan 21, 2011 at 11:38 AM, Todd Lipcon <> wrote:
> Hey folks,
>
> Took some time over the last day or two to follow up on this on the latest
> checkout of JDK7. I added some more instrumentation and my findings so far
> are:
>
> 1) CMS is definitely hitting a fragmentation problem. Our workload is
> pretty much guaranteed to fragment, and I don't think there's anything CMS
> can do about it - see the following graphs:
> http://people.apache.org/~todd/hbase-fragmentation/
>
> 2) G1GC is hitting
> full pauses because the "other" pause time predictions end up higher than
> the minimum pause length. I'm seeing the following sequence:
>
> - A single choose_cset operation for a non_young region takes a long time
> (unclear yet why this is happening, see below)
> - This inflates the predict_non_young_other_time_ms(1) result to a value
> greater than my pause goal
> - From then on, it doesn't collect any more non-young regions (ever!)
> because any region will be considered expensive regardless of the estimated
> rset or collection costs
> - The heap slowly fills up with non-young regions until we reach a full GC
>
> 3) So the question is why the choose_cset is taking a long time. I added
> getrusage() calls to wrap the choose_cset operation. Here's some output with
> extra logging:
>
> --> About to choose cset at 725.458
> Adding 1 young regions to the CSet
> Added 0x0000000000000001 Young Regions to CS.
> (3596288 KB left in heap.)
> (picked region; 9.948053ms predicted; 21.164738ms remaining; 2448kb
> marked; 2448kb maxlive; 59-59% liveness)
> (3593839 KB left in heap.)
> predict_region_elapsed_time_ms: 10.493828ms total, 5.486072ms rs scan
> (14528 cnum), 4.828102 copy time (2333800 bytes), 0.179654 other time
> (picked region; 10.493828ms predicted; 11.216685ms remaining; 2279kb
> marked; 2279kb maxlive; 55-55% liveness)
> predict_region_elapsed_time_ms: 10.493828ms total, 5.486072ms rs scan
> (14528 cnum), 4.828102 copy time (2333800 bytes), 0.179654 other time
> (3591560 KB left in heap.)
> predict_region_elapsed_time_ms: 10.346346ms total, 5.119780ms rs scan
> (13558 cnum), 5.046912 copy time (2439568 bytes), 0.179654 other time
> predict_region_elapsed_time_ms: 10.407672ms total, 5.333135ms rs scan
> (14123 cnum), 4.894882 copy time (2366080 bytes), 0.179654 other time
> (no more marked regions; next region too expensive (adaptive; predicted
> 10.407672ms > remaining 0.722857ms))
> Resource usage of choose_cset:majflt: 0 nswap: 0 nvcsw: 6 nivcsw: 0
> --> About to prepare RS scan at 725.657
>
> The resource usage line with nvcsw=6 indicates there were 6 voluntary
> context switches while choosing cset. This choose_cset operation took
> 198.9ms all in choosing non-young.
>
> So, why are there voluntary context switches while choosing cset? This
> isn't swapping -- that should show under majflt, right? My only theories
> are:
> - are any locks acquired in choose_cset?
> - maybe the gc logging itself is blocking on IO to the log file? ie the
> instrumentation itself is interfering with the algorithm?
>
>
> Regardless, I think a single length choose_non_young_cset operation
> shouldn't be allowed to push the prediction above the time boundary and
> trigger this issue. Perhaps a simple workaround is that, whenever a
> collection chooses no non_young regions, it should contribute a value of 0
> to the average?
>
> I'll give this heuristic a try on my build and see if it solves the issue.
>
> -Todd
>
> On Tue, Jul 27, 2010 at 3:08 PM, Todd Lipcon <> wrote:
>
>> Hi all,
>>
>> Back from my vacation and took some time yesterday and today to build a
>> fresh JDK 7 with some additional debug printouts from Peter's patch.
>>
>> What I found was a bit different - the rset scanning estimates are low,
>> but I consistently am seeing "Other time" estimates in the >40ms range.
>> Given my pause time goal of 20ms, these estimates are I think excluding most
>> of the regions from collectability. I haven't been able to dig around yet to
>> figure out where the long estimate for "other time" is coming from - in the
>> collections logged it sometimes shows fairly high "Other" but the "Choose
>> CSet" component is very short. I'll try to add some more debug info to the
>> verbose logging and rerun some tests over the next couple of days.
>>
>> At the moment I'm giving the JRockit VM a try to see how its deterministic
>> GC stacks up against G1 and CMS.
>>
>> Thanks
>> -Todd
>>
>>
>> On Tue, Jul 13, 2010 at 5:15 PM, Peter Schuller <
>> > wrote:
>>
>>> Ramki/Tony,
>>>
>>> > Any chance of setting +PrintHeapAtGC and -XX:+PrintHeapAtGCExtended and
>>> > sending us the log, or part of it (say between two Full GCs)? Be
>>> prepared:
>>> > this will generate piles of output. But it will give us per-region
>>> > information that might shed more light on the cause of the issue....
>>> thanks,
>>>
>>> So what I have in terms of data is (see footnotes for urls references in
>>> []):
>>>
>>> (a) A patch[1] that prints some additional information about estimated
>>> costs of region eviction, and disables the GC efficiency check that
>>> normally terminates selection of regions. (Note: This is a throw-away
>>> patch for debugging; it's not intended as a suggested change for
>>> inclusion.)
>>>
>>> (b) A log[2] showing the output of a test run I did just now, with
>>> both your flags above and my patch enabled (but without disabling the
>>> efficiency check). It shows fallback to full GC when the actual live
>>> set size is 252 MB, and the maximum heap size is 2 GB (in other words,
>>> ~ 12% liveness). An easy way to find the point of full gc is to search
>>> for the string 'full 1'.
>>>
>>> (c) A file[3] with the effective VM options during the test.
>>>
>>> (d) Instructions for how to run the test to reproduce it (I'll get to
>>> that at the end; it's simplified relative to previously).
>>>
>>> (e) Nature of the test.
>>>
>>> Discussion:
>>>
>>> WIth respect to region information: I originally tried it in response
>>> to your recommendation earlier, but I found I did not see the
>>> information I was after. Perhaps I was just misreading it, but I
>>> mostly just saw either 0% or 100% fullness, and never the actual
>>> liveness estimate as produced by the mark phase. In the log I am
>>> referring to in this E-Mail, you can see that the last printout of
>>> region information just before the live GC fits this pattern; I just
>>> don't see anything that looks like legitimate liveness information
>>> being printed. (I don't have time to dig back into it right now to
>>> double-check what it's printing.)
>>>
>>> If you scroll up from the point of the full gc until you find a bunch
>>> of output starting with "predict_region_elapsed_time_ms" you see some
>>> output resulting from the patch, with pretty extreme values such as:
>>>
>>> predict_region_elapsed_time_ms: 34.378642ms total, 34.021154ms rs scan
>>> (46542 cnum), 0.040069 copy time (20704 bytes), 0.317419 other time
>>> predict_region_elapsed_time_ms: 45.020866ms total, 44.653222ms rs scan
>>> (61087 cnum), 0.050225 copy time (25952 bytes), 0.317419 other time
>>> predict_region_elapsed_time_ms: 16.250033ms total, 15.887065ms rs scan
>>> (21734 cnum), 0.045550 copy time (23536 bytes), 0.317419 other time
>>> predict_region_elapsed_time_ms: 226.942877ms total, 226.559163ms rs
>>> scan (309940 cnum), 0.066296 copy time (34256 bytes), 0.317419 other
>>> time
>>> predict_region_elapsed_time_ms: 542.344828ms total, 541.954750ms rs
>>> scan (741411 cnum), 0.072659 copy time (37544 bytes), 0.317419 other
>>> time
>>> predict_region_elapsed_time_ms: 668.595597ms total, 668.220877ms rs
>>> scan (914147 cnum), 0.057301 copy time (29608 bytes), 0.317419 other
>>> time
>>>
>>> So in the most extreme case in the excerpt above, that's > half a
>>> second of estimate rset scanning time for a single region with 914147
>>> cards to be scanned. While not all are that extreme, lots and lots of
>>> regions are very expensive and almost only due to rset scanning costs.
>>>
>>> If you scroll down a bit to the first (and ONLY) partial that happened
>>> after the statistics accumulating from the marking phase, we see more
>>> output resulting form the patch. At the end, we see:
>>>
>>> (picked region; 0.345890ms predicted; 1.317244ms remaining; 15kb
>>> marked; 15kb maxlive; 1-1% liveness)
>>> (393380 KB left in heap.)
>>> (picked region; 0.345963ms predicted; 0.971354ms remaining; 15kb
>>> marked; 15kb maxlive; 1-1% liveness)
>>> (393365 KB left in heap.)
>>> (picked region; 0.346036ms predicted; 0.625391ms remaining; 15kb
>>> marked; 15kb maxlive; 1-1% liveness)
>>> (393349 KB left in heap.)
>>> (no more marked regions; next region too expensive (adaptive;
>>> predicted 0.346036ms > remaining 0.279355ms))
>>>
>>> So in other words, it picked a bunch of regions in order of "lowest
>>> hanging fruit". The *least* low hanging fruit picked still had
>>> liveness at 1%; in other words, there's plenty of further regions that
>>> ideally should be collected because they contain almost no garbage
>>> (ignoring the cost of collecting them).
>>>
>>> In this case, it stopped picking regions because the next region to be
>>> picked, though cheap, was the straw that broke the camel's neck and we
>>> simply exceeded the alloted time for this particular GC.
>>>
>>> However, after this partial completes, it reverts back to doing just
>>> young gc:s. In other words, even though there's *plenty* of regions
>>> with very low liveness, further partials aren't happening.
>>>
>>> By applying this part of the patch:
>>>
>>> - (adaptive_young_list_length() &&
>>> + (adaptive_young_list_length() && false && // scodetodo
>>>
>>> I artificially force g1 to not fall back to doing young gc:s for
>>> efficiency reasons. When I run with that change, I don't experience
>>> the slow perpetual growth until fallback to full GC. If I remember
>>> correctly though, the rset scanning cost is in fact high, but I don't
>>> have details saved and I'm afraid I don't have time to re-run those
>>> tests right now and compare numbers.
>>>
>>> Reproducing it:
>>>
>>> I made some changes and the test case should now hopefully be easy to
>>> run assuming you have maven installed. The github project is at:
>>>
>>> http://github.com/scode/httpgctest
>>>
>>> There is a README, but the shortest possible instructions to
>>> re-produce the test that I did:
>>>
>>> git clone git://github.com/scode/httpgctest.git
>>> cd httpgctest.git
>>> git checkout 20100714_1 # grab from appropriate tag, in case I
>>> change master
>>> mvn package
>>> HTTPGCTEST_LOGGC=gc.log ./run.sh
>>>
>>> That should start the http server; then run concurrently:
>>>
>>> while [ 1 ] ; do curl 'http://localhost:9191/dropdata?ratio=0.10' ;
>>> curl 'http://localhost:9191/gendata?amount=25000' ; sleep 0.1 ; done
>>>
>>> And then just wait and observe.
>>>
>>> Nature of the test:
>>>
>>> So the test if run as above will essentially reach a steady state of
>>> equilibrium with about 25000 pieces of data in a clojure immutable
>>> map. The result is that a significant amount of new data is being
>>> allocated, but very little writing to old regions is happening. The
>>> garbage generated is very well spread out over the entire heap because
>>> it goes through all objects and drops 10% (the ratio=0.10) for each
>>> iteration, after which it adds 25000 new items.
>>>
>>> In other words; not a lot of old gen writing, but lots of writes to
>>> the young gen referencing objects in the old gen.
>>>
>>> [1] http://distfiles.scode.org/mlref/g1/g1_region_live_stats_hack.patch
>>> [2]
>>> http://distfiles.scode.org/mlref/gctest/httpgctest-g1-fullgc-20100714/gc-fullfallback.log
>>> [3]
>>> http://distfiles.scode.org/mlref/gctest/httpgctest-g1-fullgc-20100714/vmoptions.txt
>>>
>>> --
>>> / Peter Schuller
>>>
>>
>>
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>
--
Todd Lipcon
Software Engineer, Cloudera
> Â - most of the time no old gen regions are selected for collection at all!
> Here's a graph of just old regions:
> http://people.apache.org/~todd/hbase-fragmentation/old-regions.png
This is consistent with my anecdotal observations as well and I
believe it is expected. What I have observed happening is that
non-young (partial) collections always happen after the marking phases
some number of times, followed by young collections only until another
marking phase is triggered and completed.
I think this makes sense because region selection is based on cost
heuristics largely based on liveness data from marking. So you have
your marking phase followed by a period of decreasing availability of
non-young regions that are eligible for collection given the GC
efficiency goals (and the pause time goals), until there are 0 such.
Young collections then continue until unrelated criteria trigger a new
marking phase, giving non-young regions a chance again to get above
the eligibility watermark.
--
/ Peter Schuller
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
> I still seem to be putting off GC of non-young regions too much though. I
Part of my experiments I have been harping on was the below change to
cut GC efficiency out of the decision to perform non-young
collections. I'm not suggesting it actually be disabled, but perhaps
it can be adjusted to fit your workload? If there is nothing outright
wrong in terms of predictions and the problem is due to cost estimates
being too high, that may be a way to avoid full GC:s at the expense of
more expensive GC activity. This smells like something that should be
a tweakable VM option. Just like GCTimeRatio affects heap expansion
decisions, something to affect this (probably just a ratio applied to
the test below?).
Another thing: This is to a large part my human confirmation biased
brain speaking, but I would be really interested to find out if if the
slow build-up you seem to be experiencing is indeed due to rs scan
costs de to sparse table overflow (I've been harping about roughly the
same thing several times so maybe people are tired of it; most
recently in the thread "g1: dealing with high rates of inter-region
pointer writes").
Is your test easily runnable so that one can reproduce? Preferably
without lots of hbase/hadoop knowledge. I.e., is it something that can
be run in a self-contained fashion fairly easily?
Here's the patch indicating where to adjust the efficiency thresholding:
--- a/src/share/vm/gc_implementation/g1/g1CollectorPolicy.cpp Fri
Dec 17 23:32:58 2010 -0800
+++ b/src/share/vm/gc_implementation/g1/g1CollectorPolicy.cpp Sun
Jan 23 09:21:54 2011 +0100
@@ -1463,7 +1463,7 @@
if ( !_last_young_gc_full ) {
if ( _should_revert_to_full_young_gcs ||
_known_garbage_ratio < 0.05 ||
- (adaptive_young_list_length() &&
+ (adaptive_young_list_length() && //false && // scodetodo
(get_gc_eff_factor() * cur_efficiency < predict_young_gc_eff())) ) {
set_full_young_gcs(true);
}
--
/ Peter Schuller
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
|
# 19

24-01-2011 06:16 AM
|
|
|
Todd,
Could you send a segment of the GC logs from the beginning
through the first dozen or so full GC's?
Exactly which version of the JVM are you using?
java -version
will tell us.
Do you have a test setup where you could do some experiments?
Can you send the set of CMS flags you use? It might tell
us something about the GC behavior of you application.
Might not tell us anything but it's worth a look.
Jon
On 07/06/10 13:27, Todd Lipcon wrote:
> Hi all,
>
> I work on HBase, a distributed database written in Java. We generally
> run on large heaps (8GB+), and our object lifetime distribution has
> proven pretty problematic for garbage collection (we manage a multi-GB
> LRU cache inside the process, so in CMS we tenure a lot of byte arrays
> which later get collected).
>
> In Java6, we generally run with the CMS collector, tuning down
> CMSInitiatingOccupancyFraction and constraining MaxNewSize to achieve
> fairly low pause GC, but after a week or two of uptime we often run
> into full heap compaction which takes several minutes and wreaks havoc
> on the system.
>
> Needless to say, we've been watching the development of the G1 GC with
> anticipation for the last year or two. Finally it seems in the latest
> build of JDK7 it's stable enough for actual use (previously it would
> segfault within an hour or so). However, in my testing I'm seeing a
> fair amount of 8-10 second Full GC pauses.
>
> The flags I'm passing are:
>
> -Xmx8000m -XX:+UseG1GC -XX:MaxGCPauseMillis=20
> -XX:GCPauseIntervalMillis=80
>
> Most of the pauses I see in the GC log are around 10-20ms as expected:
>
> 2010-07-05T22:43:19.849-0700: 1680.079: [GC pause (partial),
> 0.01209000 secs]
> [Parallel Time: 10.5 ms]
> [Update RS (Start) (ms): 1680080.2 1680080.1 1680080.2
> 1680079.9 1680080.0 1680080.2 1680080.1 1680080.1 1680080.0
> 1680080.1 1680080.0 1680079.9 1680081.5]
> [Update RS (ms): 1.4 2.0 2.2 1.8 1.7 1.4 2.5 2.2 1.9
> 2.5 1.7 1.7 0.1
> Avg: 1.8, Min: 0.1, Max: 2.5]
> [Processed Buffers : 8 1 3 1 1 7 3 2 6 2 7 8 3
> Sum: 52, Avg: 4, Min: 1, Max: 8]
> [Ext Root Scanning (ms): 0.7 0.5 0.5 0.3 0.5 0.7 0.4 0.5
> 0.4 0.5 0.3 0.3 0.0
> Avg: 0.4, Min: 0.0, Max: 0.7]
> [Mark Stack Scanning (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0
> 0.0 0.0 0.0 0.0 0.0 0.0
> Avg: 0.0, Min: 0.0, Max: 0.0]
> [Scan RS (ms): 0.9 0.4 0.1 0.7 0.7 0.9 0.0 0.1 0.6 0.0
> 0.8 0.8 0.9
> Avg: 0.5, Min: 0.0, Max: 0.9]
> [Object Copy (ms): 7.2 7.2 7.3 7.3 7.1 7.1 7.0 7.2 7.1
> 6.9 7.1 7.1 7.0
> Avg: 7.1, Min: 6.9, Max: 7.3]
> [Termination (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
> 0.0 0.0 0.0 0.0
> Avg: 0.0, Min: 0.0, Max: 0.0]
> [Other: 0.6 ms]
> [Clear CT: 0.5 ms]
> [Other: 1.1 ms]
> [Choose CSet: 0.0 ms]
> [ 7677M->7636M(8000M)]
> [Times: user=0.12 sys=0.00, real=0.01 secs]
>
> But every 5-10 minutes I see a GC pause that lasts 10-15 seconds:
> [todd@monster01 logs]$ grep 'Full GC' gc-hbase.log | tail
> 2010-07-06T12:50:41.216-0700: 52521.446: [Full GC 7934M->4865M(8000M),
> 9.8907800 secs]
> 2010-07-06T12:55:39.802-0700: 52820.032: [Full GC 7930M->4964M(8000M),
> 9.9025520 secs]
> 2010-07-06T13:02:26.872-0700: 53227.102: [Full GC 7934M->4882M(8000M),
> 10.1232190 secs]
> 2010-07-06T13:09:41.049-0700: 53661.279: [Full GC 7938M->5002M(8000M),
> 10.4997760 secs]
> 2010-07-06T13:18:51.531-0700: 54211.761: [Full GC 7938M->4962M(8000M),
> 11.0497380 secs]
>
> These pauses are pretty unacceptable for soft real time operation.
>
> Am I missing some tuning that should be done for G1GC for applications
> like this? Is 20ms out of 80ms too aggressive a target for the garbage
> rates we're generating?
>
> My actual live heap usage should be very stable around 5GB - the
> application very carefully accounts its live object set at around 60%
> of the max heap (as you can see in the logs above).
>
> At this point we are considering doing crazy things like ripping out
> our main memory consumers into a custom slab allocator, and manually
> reference count the byte array slices. But, if we can get G1GC to work
> for us, it will save a lot of engineering on the application side!
>
> Thanks
> -Todd
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
> ------------------------------------------------------------------------
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
Did you try doubling the heap size? You might want to post a full
log so we can see what's happening between those full collections.
Also, If you have comparable CMS logs
all the better, as a known starting point. The full gc's almost
look like the heap got too full, so it must mean that incremental
collection is not keeping up with the rate of garbage generation.
Also, what's the JDK build you are running?
-- ramki
On 07/06/10 13:27, Todd Lipcon wrote:
> Hi all,
>
> I work on HBase, a distributed database written in Java. We generally
> run on large heaps (8GB+), and our object lifetime distribution has
> proven pretty problematic for garbage collection (we manage a multi-GB
> LRU cache inside the process, so in CMS we tenure a lot of byte arrays
> which later get collected).
>
> In Java6, we generally run with the CMS collector, tuning down
> CMSInitiatingOccupancyFraction and constraining MaxNewSize to achieve
> fairly low pause GC, but after a week or two of uptime we often run into
> full heap compaction which takes several minutes and wreaks havoc on the
> system.
>
> Needless to say, we've been watching the development of the G1 GC with
> anticipation for the last year or two. Finally it seems in the latest
> build of JDK7 it's stable enough for actual use (previously it would
> segfault within an hour or so). However, in my testing I'm seeing a fair
> amount of 8-10 second Full GC pauses.
>
> The flags I'm passing are:
>
> -Xmx8000m -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:GCPauseIntervalMillis=80
>
> Most of the pauses I see in the GC log are around 10-20ms as expected:
>
> 2010-07-05T22:43:19.849-0700: 1680.079: [GC pause (partial), 0.01209000
> secs]
> [Parallel Time: 10.5 ms]
> [Update RS (Start) (ms): 1680080.2 1680080.1 1680080.2
> 1680079.9 1680080.0 1680080.2 1680080.1 1680080.1 1680080.0
> 1680080.1 1680080.0 1680079.9 1680081.5]
> [Update RS (ms): 1.4 2.0 2.2 1.8 1.7 1.4 2.5 2.2 1.9 2.5
> 1.7 1.7 0.1
> Avg: 1.8, Min: 0.1, Max: 2.5]
> [Processed Buffers : 8 1 3 1 1 7 3 2 6 2 7 8 3
> Sum: 52, Avg: 4, Min: 1, Max: 8]
> [Ext Root Scanning (ms): 0.7 0.5 0.5 0.3 0.5 0.7 0.4 0.5
> 0.4 0.5 0.3 0.3 0.0
> Avg: 0.4, Min: 0.0, Max: 0.7]
> [Mark Stack Scanning (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
> 0.0 0.0 0.0 0.0 0.0
> Avg: 0.0, Min: 0.0, Max: 0.0]
> [Scan RS (ms): 0.9 0.4 0.1 0.7 0.7 0.9 0.0 0.1 0.6 0.0
> 0.8 0.8 0.9
> Avg: 0.5, Min: 0.0, Max: 0.9]
> [Object Copy (ms): 7.2 7.2 7.3 7.3 7.1 7.1 7.0 7.2 7.1
> 6.9 7.1 7.1 7.0
> Avg: 7.1, Min: 6.9, Max: 7.3]
> [Termination (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
> 0.0 0.0 0.0 0.0
> Avg: 0.0, Min: 0.0, Max: 0.0]
> [Other: 0.6 ms]
> [Clear CT: 0.5 ms]
> [Other: 1.1 ms]
> [Choose CSet: 0.0 ms]
> [ 7677M->7636M(8000M)]
> [Times: user=0.12 sys=0.00, real=0.01 secs]
>
> But every 5-10 minutes I see a GC pause that lasts 10-15 seconds:
> [todd@monster01 logs]$ grep 'Full GC' gc-hbase.log | tail
> 2010-07-06T12:50:41.216-0700: 52521.446: [Full GC 7934M->4865M(8000M),
> 9.8907800 secs]
> 2010-07-06T12:55:39.802-0700: 52820.032: [Full GC 7930M->4964M(8000M),
> 9.9025520 secs]
> 2010-07-06T13:02:26.872-0700: 53227.102: [Full GC 7934M->4882M(8000M),
> 10.1232190 secs]
> 2010-07-06T13:09:41.049-0700: 53661.279: [Full GC 7938M->5002M(8000M),
> 10.4997760 secs]
> 2010-07-06T13:18:51.531-0700: 54211.761: [Full GC 7938M->4962M(8000M),
> 11.0497380 secs]
>
> These pauses are pretty unacceptable for soft real time operation.
>
> Am I missing some tuning that should be done for G1GC for applications
> like this? Is 20ms out of 80ms too aggressive a target for the garbage
> rates we're generating?
>
> My actual live heap usage should be very stable around 5GB - the
> application very carefully accounts its live object set at around 60% of
> the max heap (as you can see in the logs above).
>
> At this point we are considering doing crazy things like ripping out our
> main memory consumers into a custom slab allocator, and manually
> reference count the byte array slices. But, if we can get G1GC to work
> for us, it will save a lot of engineering on the application side!
>
> Thanks
> -Todd
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>
>
> ------------------------------------------------------------------------
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
I also work with Todd on this systems (I am one of the other people
with the alternate CMS config) and doubling the heap size from 8 GB to
16GB is a little insane... we'd like to have some amount of reasonable
memory efficiency here... The thing is the more we can get out of our
ram for this block cache, the better performing our systems are. Also
a lot of the settings are self tuning, so if we up the Xmx the size of
the block cache is scaled as well.
-ryan
On Tue, Jul 6, 2010 at 2:12 PM, Y. S. Ramakrishna
<> wrote:
> Did you try doubling the heap size? You might want to post a full
> log so we can see what's happening between those full collections.
> Also, If you have comparable CMS logs
> all the better, as a known starting point. The full gc's almost
> look like the heap got too full, so it must mean that incremental
> collection is not keeping up with the rate of garbage generation.
> Also, what's the JDK build you are running?
>
> -- ramki
>
> On 07/06/10 13:27, Todd Lipcon wrote:
>> Hi all,
>>
>> I work on HBase, a distributed database written in Java. We generally
>> run on large heaps (8GB+), and our object lifetime distribution has
>> proven pretty problematic for garbage collection (we manage a multi-GB
>> LRU cache inside the process, so in CMS we tenure a lot of byte arrays
>> which later get collected).
>>
>> In Java6, we generally run with the CMS collector, tuning down
>> CMSInitiatingOccupancyFraction and constraining MaxNewSize to achieve
>> fairly low pause GC, but after a week or two of uptime we often run into
>> full heap compaction which takes several minutes and wreaks havoc on the
>> system.
>>
>> Needless to say, we've been watching the development of the G1 GC with
>> anticipation for the last year or two. Finally it seems in the latest
>> build of JDK7 it's stable enough for actual use (previously it would
>> segfault within an hour or so). However, in my testing I'm seeing a fair
>> amount of 8-10 second Full GC pauses.
>>
>> The flags I'm passing are:
>>
>> -Xmx8000m -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:GCPauseIntervalMillis=80
>>
>> Most of the pauses I see in the GC log are around 10-20ms as expected:
>>
>> 2010-07-05T22:43:19.849-0700: 1680.079: [GC pause (partial), 0.01209000
>> secs]
>> [Parallel Time: 10.5 ms]
>> [Update RS (Start) (ms): 1680080.2 1680080.1 1680080.2
>> 1680079.9 1680080.0 1680080.2 1680080.1 1680080.1 1680080.0
>> 1680080.1 1680080.0 1680079.9 1680081.5]
>> [Update RS (ms): 1.4 2.0 2.2 1.8 1.7 1.4 2.5 2.2 1.9 2.5
>> 1.7 1.7 0.1
>> Avg: 1.8, Min: 0.1, Max: 2.5]
>> [Processed Buffers : 8 1 3 1 1 7 3 2 6 2 7 8 3
>> Sum: 52, Avg: 4, Min: 1, Max: 8]
>> [Ext Root Scanning (ms): 0.7 0.5 0.5 0.3 0.5 0.7 0.4 0.5
>> 0.4 0.5 0.3 0.3 0.0
>> Avg: 0.4, Min: 0.0, Max: 0.7]
>> [Mark Stack Scanning (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
>> 0.0 0.0 0.0 0.0 0.0
>> Avg: 0.0, Min: 0.0, Max: 0.0]
>> [Scan RS (ms): 0.9 0.4 0.1 0.7 0.7 0.9 0.0 0.1 0.6 0.0
>> 0.8 0.8 0.9
>> Avg: 0.5, Min: 0.0, Max: 0.9]
>> [Object Copy (ms): 7.2 7.2 7.3 7.3 7.1 7.1 7.0 7.2 7.1
>> 6.9 7.1 7.1 7.0
>> Avg: 7.1, Min: 6.9, Max: 7.3]
>> [Termination (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
>> 0.0 0.0 0.0 0.0
>> Avg: 0.0, Min: 0.0, Max: 0.0]
>> [Other: 0.6 ms]
>> [Clear CT: 0.5 ms]
>> [Other: 1.1 ms]
>> [Choose CSet: 0.0 ms]
>> [ 7677M->7636M(8000M)]
>> [Times: user=0.12 sys=0.00, real=0.01 secs]
>>
>> But every 5-10 minutes I see a GC pause that lasts 10-15 seconds:
>> [todd@monster01 logs]$ grep 'Full GC' gc-hbase.log | tail
>> 2010-07-06T12:50:41.216-0700: 52521.446: [Full GC 7934M->4865M(8000M),
>> 9.8907800 secs]
>> 2010-07-06T12:55:39.802-0700: 52820.032: [Full GC 7930M->4964M(8000M),
>> 9.9025520 secs]
>> 2010-07-06T13:02:26.872-0700: 53227.102: [Full GC 7934M->4882M(8000M),
>> 10.1232190 secs]
>> 2010-07-06T13:09:41.049-0700: 53661.279: [Full GC 7938M->5002M(8000M),
>> 10.4997760 secs]
>> 2010-07-06T13:18:51.531-0700: 54211.761: [Full GC 7938M->4962M(8000M),
>> 11.0497380 secs]
>>
>> These pauses are pretty unacceptable for soft real time operation.
>>
>> Am I missing some tuning that should be done for G1GC for applications
>> like this? Is 20ms out of 80ms too aggressive a target for the garbage
>> rates we're generating?
>>
>> My actual live heap usage should be very stable around 5GB - the
>> application very carefully accounts its live object set at around 60% of
>> the max heap (as you can see in the logs above).
>>
>> At this point we are considering doing crazy things like ripping out our
>> main memory consumers into a custom slab allocator, and manually
>> reference count the byte array slices. But, if we can get G1GC to work
>> for us, it will save a lot of engineering on the application side!
>>
>> Thanks
>> -Todd
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
>>
>> ------------------------------------------------------------------------
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
Two questions:
(1) do you enable class unloading with CMS? (I do not see that
below in yr option list, but wondered.)
(2) does your application load classes, or intern a lot of strings?
If i am reading the logs right, G1 appears to reclaim less
and less of the heap in each cycle until a full collection
intervenes, and I have no real explanation for this behaviour
except that perhaps there's something in the perm gen that
keeps stuff in the remainder of the heap artificially live.
G1 does not incrementally collect the young gen, so this is
plausible. But CMS does not either by default and I do not see
that option in the CMS options list you gave below. It would
be instructive to see what the comparable CMS logs look like.
May be then you could start with the same heap shapes for the
two and see if you can get to the bottom of the full gc (which
as i understand you get to more quickly w/G1 than you did
w/CMS).
-- ramki
On 07/06/10 14:24, Todd Lipcon wrote:
> On Tue, Jul 6, 2010 at 2:09 PM, Jon Masamitsu <
> > wrote:
>
> Todd,
>
> Could you send a segment of the GC logs from the beginning
> through the first dozen or so full GC's?
>
>
> Sure, I just put it online at:
>
> http://cloudera-todd.s3.amazonaws.com/gc-log-g1gc.txt
>
>
>
> Exactly which version of the JVM are you using?
>
> java -version
>
> will tell us.
>
>
> Latest as of last night:
>
> [todd@monster01 ~]$ ./jdk1.7.0/jre/bin/java -version
> java version "1.7.0-ea"
> Java(TM) SE Runtime Environment (build 1.7.0-ea-b99)
> Java HotSpot(TM) 64-Bit Server VM (build 19.0-b03, mixed mode)
>
>
> Do you have a test setup where you could do some experiments?
>
>
> Sure, I have a five node cluster here where I do lots of testing, happy
> to try different builds/options/etc (though I probably don't have time
> to apply patches and rebuild the JDK myself)
>
>
> Can you send the set of CMS flags you use? It might tell
> us something about the GC behavior of you application.
> Might not tell us anything but it's worth a look.
>
>
> Different customers have found different flags to work well for them.
> One user uses the following:
>
>
> -Xmx12000m -XX:+HeapDumpOnOutOfMemoryError -XX:+UseConcMarkSweepGC \
> -XX:NewSize=64m -XX:MaxNewSize=64m -XX:CMSInitiatingOccupancyFraction=40 \
>
>
> Another uses:
>
>
> -XX:+DoEscapeAnalysis -XX:+AggressiveOpts -XX:+UseConcMarkSweepGC
> -XX:NewSize=64m -XX:MaxNewSize=64m
> -XX:CMSInitiatingOccupancyFraction=88 -verbose:gc -XX:+PrintGCDetails
>
>
>
>
>
>
>
> The particular tuning options probably depend on the actual cache
> workload of the user. I tend to recommend CMSInitiatingOccupancyFraction
> around 75 or so, since the software maintains about 60% heap usage. I
> also think a NewSize slightly larger would improve things a bit, but if
> it gets more than 256m or so, the ParNew pauses start to be too long for
> a lot of use cases.
>
> Regarding CMS logs, I can probably restart this test later this
> afternoon on CMS and run it for a couple hours, but it isn't likely to
> hit the multi-minute compaction that quickly. It happens more in the wild.
>
> -Todd
>
>
>
> On 07/06/10 13:27, Todd Lipcon wrote:
>> Hi all,
>>
>> I work on HBase, a distributed database written in Java. We
>> generally run on large heaps (8GB+), and our object lifetime
>> distribution has proven pretty problematic for garbage collection
>> (we manage a multi-GB LRU cache inside the process, so in CMS we
>> tenure a lot of byte arrays which later get collected).
>>
>> In Java6, we generally run with the CMS collector, tuning down
>> CMSInitiatingOccupancyFraction and constraining MaxNewSize to
>> achieve fairly low pause GC, but after a week or two of uptime we
>> often run into full heap compaction which takes several minutes
>> and wreaks havoc on the system.
>>
>> Needless to say, we've been watching the development of the G1 GC
>> with anticipation for the last year or two. Finally it seems in
>> the latest build of JDK7 it's stable enough for actual use
>> (previously it would segfault within an hour or so). However, in
>> my testing I'm seeing a fair amount of 8-10 second Full GC pauses.
>>
>> The flags I'm passing are:
>>
>> -Xmx8000m -XX:+UseG1GC -XX:MaxGCPauseMillis=20
>> -XX:GCPauseIntervalMillis=80
>>
>> Most of the pauses I see in the GC log are around 10-20ms as expected:
>>
>> 2010-07-05T22:43:19.849-0700: 1680.079: [GC pause (partial),
>> 0.01209000 secs]
>> [Parallel Time: 10.5 ms]
>> [Update RS (Start) (ms): 1680080.2 1680080.1 1680080.2
>> 1680079.9 1680080.0 1680080.2 1680080.1 1680080.1 1680080.0
>> 1680080.1 1680080.0 1680079.9 1680081.5]
>> [Update RS (ms): 1.4 2.0 2.2 1.8 1.7 1.4 2.5 2.2
>> 1.9 2.5 1.7 1.7 0.1
>> Avg: 1.8, Min: 0.1, Max: 2.5]
>> [Processed Buffers : 8 1 3 1 1 7 3 2 6 2 7 8 3
>> Sum: 52, Avg: 4, Min: 1, Max: 8]
>> [Ext Root Scanning (ms): 0.7 0.5 0.5 0.3 0.5 0.7 0.4
>> 0.5 0.4 0.5 0.3 0.3 0.0
>> Avg: 0.4, Min: 0.0, Max: 0.7]
>> [Mark Stack Scanning (ms): 0.0 0.0 0.0 0.0 0.0 0.0
>> 0.0 0.0 0.0 0.0 0.0 0.0 0.0
>> Avg: 0.0, Min: 0.0, Max: 0.0]
>> [Scan RS (ms): 0.9 0.4 0.1 0.7 0.7 0.9 0.0 0.1 0.6
>> 0.0 0.8 0.8 0.9
>> Avg: 0.5, Min: 0.0, Max: 0.9]
>> [Object Copy (ms): 7.2 7.2 7.3 7.3 7.1 7.1 7.0 7.2
>> 7.1 6.9 7.1 7.1 7.0
>> Avg: 7.1, Min: 6.9, Max: 7.3]
>> [Termination (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
>> 0.0 0.0 0.0 0.0 0.0
>> Avg: 0.0, Min: 0.0, Max: 0.0]
>> [Other: 0.6 ms]
>> [Clear CT: 0.5 ms]
>> [Other: 1.1 ms]
>> [Choose CSet: 0.0 ms]
>> [ 7677M->7636M(8000M)]
>> [Times: user=0.12 sys=0.00, real=0.01 secs]
>>
>> But every 5-10 minutes I see a GC pause that lasts 10-15 seconds:
>> [todd@monster01 logs]$ grep 'Full GC' gc-hbase.log | tail
>> 2010-07-06T12:50:41.216-0700: 52521.446: [Full GC
>> 7934M->4865M(8000M), 9.8907800 secs]
>> 2010-07-06T12:55:39.802-0700: 52820.032: [Full GC
>> 7930M->4964M(8000M), 9.9025520 secs]
>> 2010-07-06T13:02:26.872-0700: 53227.102: [Full GC
>> 7934M->4882M(8000M), 10.1232190 secs]
>> 2010-07-06T13:09:41.049-0700: 53661.279: [Full GC
>> 7938M->5002M(8000M), 10.4997760 secs]
>> 2010-07-06T13:18:51.531-0700: 54211.761: [Full GC
>> 7938M->4962M(8000M), 11.0497380 secs]
>>
>> These pauses are pretty unacceptable for soft real time operation.
>>
>> Am I missing some tuning that should be done for G1GC for
>> applications like this? Is 20ms out of 80ms too aggressive a
>> target for the garbage rates we're generating?
>>
>> My actual live heap usage should be very stable around 5GB - the
>> application very carefully accounts its live object set at around
>> 60% of the max heap (as you can see in the logs above).
>>
>> At this point we are considering doing crazy things like ripping
>> out our main memory consumers into a custom slab allocator, and
>> manually reference count the byte array slices. But, if we can get
>> G1GC to work for us, it will save a lot of engineering on the
>> application side!
>>
>> Thanks
>> -Todd
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>> ------------------------------------------------------------------------
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
On 07/07/10 08:45, Todd Lipcon wrote:
...
>
> Overnight I saw one "concurrent mode failure".
...
> 2010-07-07T07:56:27.786-0700: 28490.203: [GC 28490.203: [ParNew
> (promotion failed): 59008K->59008K(59008K), 0.0179250 secs]28490.221:
> [CMS2010-07-07T07:56:27.901-0700: 28490.317: [CMS-concurrent-preclean:
> 0.556/0.947 secs] [Times:
> user=5.76 sys=0.26, real=0.95 secs]
> (concurrent mode failure): 6359176K->4206871K(8323072K), 17.4366220
> secs] 6417373K->4206871K(8382080K), [CMS Perm : 18609K->18565K(31048K)],
> 17.4546890 secs] [Times: user=11.17 sys=0.09, real=17.45 secs]
>
> I've interpreted pauses like this as being caused by fragmentation,
> since the young gen is 64M, and the old gen here has about 2G free. If
> there's something I'm not understanding about CMS, and I can tune it
> more smartly to avoid these longer pauses, I'm happy to try.
Yes the old gen must be fragmented. I'll look at the data you have
made available (for CMS). The CMS log you uploaded does not have the
suffix leading into the concurrent mode failure ypu display above
(it stops less than 2500 s into the run). If you could include
the entire log leading into the concurrent mode failures, it would
be a great help. Do you have large arrays in your
application? The shape of the promotion graph for CMS is somewhat
jagged, indicating _perhaps_ that. Yes, +PrintTenuringDistribution
would shed a bit more light. As regards fragmentation, it can be
tricky to tune against, but we can try once we understand a bit
more about the object sizes and demographics.
I am sure you don't have an easily shared test case, so we
can reproduce both the CMS fragmentation and the G1 full gc
issues locally for quickest progress on this?
-- ramki
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
On 07/07/10 17:18, Todd Lipcon wrote:
...
> Looking at the graph you attached, it appears that the low-water mark
> stabilizes at somewhere between 4.5G and 5G. The configuration I'm
> running is to allocate 40% of the heap to Memstore and 20% of the heap
> to the LRU cache. For an 8G heap, this is 4.8GB. So, for this
> application it's somewhat expected that, as it runs, it will accumulate
> more and more data until it reaches this threshold. The data is, of
> course, not *permanent*, but it's reasonably long-lived, so it makes
> sense to me that it should go into the old generation.
Ah, i see. In that case, i think you could try using a slightly larger old
gen. If the old gen stabilizes at 4.2 GB, we should allow as much for slop.
i.e. make the old gen 8.4 GB (or whatever is the measured stable
old gen occupancy), then add to that the young gen size, and use
that for the whole heap. I would be even more aggressive
and grant more to the old gen -- as i said earlier perhaps
double the old gen from its present size. If that doesn;t work
we know that something is amiss in the way we are going at this.
If it works, we can iterate downwards from a config that we know
works, down to what may be considered an acceptable space overhead
for GC.
>
> If you like, I can tune down those percentages to 20/20 instead of
> 20/40, and I think we'll see the same pattern, just stabilized around
> 3.2GB. This will probably delay the full GCs, but still eventually hit
> them. It's also way lower than we can really go - customers won't like
> "throwing away" 60% of the allocated heap to GC!
I understand that sentiment. I want us to get to a state where we are able
to completely avoid the creeping fragmentation, if possible. There are
other ways to tune for this, but they are more labour-intensive and tricky,
and I would not want to go into that lightly. You might want to contact
your Java support for help with that.
>
>
> Perhaps jmap -histo:live or +PrintClassHistogram[AfterFullGC]
> will help you get to the bottom of yr leak. Once the leak is plugged
> perhaps we could come back to the G1 tuning effort? (We have some
> guesses as to what might be happening and the best G1 minds are
> chewing on the info you provided so far, for which thanks!)
>
>
> I can try running with those options and see what I see, but I've
> already spent some time looking at heap dumps, and not found any leaks,
> so I'm pretty sure it's not the issue.
OK, in that case it's not worth doing, since you've already ruled
out leaks.
I'll think some more about this meanwhile.
thanks.
-- ramki
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
On 07/07/10 17:32, Todd Lipcon wrote:
...
> OK, I can try some tests with cache configured for only 40% heap usage.
> Should I run these tests with CMS or G1?
I'd first try CMS, and if that works, try G1.
>
>
>
>
> If you like, I can tune down those percentages to 20/20 instead
> of 20/40, and I think we'll see the same pattern, just
> stabilized around 3.2GB. This will probably delay the full GCs,
> but still eventually hit them. It's also way lower than we can
> really go - customers won't like "throwing away" 60% of the
> allocated heap to GC!
>
>
> I understand that sentiment. I want us to get to a state where we
> are able
> to completely avoid the creeping fragmentation, if possible. There are
> other ways to tune for this, but they are more labour-intensive and
> tricky,
> and I would not want to go into that lightly. You might want to contact
> your Java support for help with that.
>
>
> Yep, we've considered various solutions involving managing our own
> ref-counted slices of a single pre-allocated byte array - essentially
> writing our own slab allocator. In theory this should make all of the
> GCable objects constrained to a small number of sizes, and thus prevent
> fragmentation, but it's quite a project to undertake :)
That would be overdoing it. I didn't mean anything so drastic and certainly
nothing so drastic at the application level. When I said "labour intensive"
I meant tuning GC to avoid that kind of fragmentation would be more work.
>
> Regarding Java support, as an open source project we have no such
> luxury. Projects like HBase and Hadoop, though, are pretty visible to
> users as "big Java apps", so getting them working well on the GC front
> does good things for Java adoption in the database/distributed systems
> community, I think.
I agree, and we certainly should.
-- ramki
>
> -Todd
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>
>
> ------------------------------------------------------------------------
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
> Am I missing some tuning that should be done for G1GC for applications like
> this? Is 20ms out of 80ms too aggressive a target for the garbage rates
> we're generating?
I have never run HBase, but in an LRU stress test (I posted about it a
few months ago) I specifically observed remembered set scanning costs
go way up. In addition I was seeing fallbacks to full GC:s recently in
a slightly different test that I also posed about to -use, and that
turned out to be a result of the estimated rset scanning costs being
so high that regions were never selected for eviction even though they
had very little live data. I would be very interested to hear if
you're having the same problem. My last post on the topic is here:
http://mail.openjdk.java.net/pipermail/hotspot-gc-use/2010-June/000652.html
Including the link to the (throw-away) patch that should tell you
whether this is what's happening:
http://distfiles.scode.org/mlref/g1/g1_region_live_stats_hack.patch
Out of personal curiosity I'd be very interested to hear whether this
is what's happening to you (in a real reasonable use-case rather than
a synthetic benchmark).
My sense (and hotspot/g1 developers please smack me around if I am
misrepresenting anything here) is that the effect I saw (with rset
scanning costs) could cause perpetual memory grow (until fallback to
full GC) in two ways:
(1) The estimated (and possibly real) cost of rset scanning for a
single region could be so high that it is never possible to select it
for eviction given the asked for pause time goals. Hence, such a
region effectively "leaks" until full GC.
(2) The estimated (and possibly real) cost of rset scanning for
regions may be so high that there are, in practice, always other
regions selected for high pay-off/cost ratios, such that they end up
never being collected even if theoretically a single region could be
evicted within the pause time goal.
These are effectively the same thing, with (1) being an extreme case of (2).
In both cases, the effect should be mitigated (and have been in the
case where I did my testing), but as far as I can tell not generally
"fixed", by increasing the pause time goals.
It is unclear to me how this is intended to be handled. The original
g1 paper mentions an rset scanning thread that I may suspect would be
intended to help do rset scanning in the background such that regions
like these could be evicted more cheaply during the STW eviction
pause; but I didn't find such a thread anywhere in the source code -
but I may very well just be missing it.
--
/ Peter Schuller
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
Hi Peter --
Yes, my guess was also that something (possibly along the lines
you stated below) was preventing the selection of certain (sets
of) regions for evacuation on a regular basis ... I am told there
are flags that will allow you to get verbose details on what is
or is not selected for inclusion in the collection set; perhaps
that will help you get down to the bottom of this. Did you say
you had a test case that showed this behaviour? Filing a bug
with that test case may be the quickest way to get this before
the right set of eyes. Over to the G1 cognoscenti.
-- ramki
On 07/12/10 09:02, Peter Schuller wrote:
>> Am I missing some tuning that should be done for G1GC for applications like
>> this? Is 20ms out of 80ms too aggressive a target for the garbage rates
>> we're generating?
>
> I have never run HBase, but in an LRU stress test (I posted about it a
> few months ago) I specifically observed remembered set scanning costs
> go way up. In addition I was seeing fallbacks to full GC:s recently in
> a slightly different test that I also posed about to -use, and that
> turned out to be a result of the estimated rset scanning costs being
> so high that regions were never selected for eviction even though they
> had very little live data. I would be very interested to hear if
> you're having the same problem. My last post on the topic is here:
>
> http://mail.openjdk.java.net/pipermail/hotspot-gc-use/2010-June/000652.html
>
> Including the link to the (throw-away) patch that should tell you
> whether this is what's happening:
>
> http://distfiles.scode.org/mlref/g1/g1_region_live_stats_hack.patch
>
> Out of personal curiosity I'd be very interested to hear whether this
> is what's happening to you (in a real reasonable use-case rather than
> a synthetic benchmark).
>
> My sense (and hotspot/g1 developers please smack me around if I am
> misrepresenting anything here) is that the effect I saw (with rset
> scanning costs) could cause perpetual memory grow (until fallback to
> full GC) in two ways:
>
> (1) The estimated (and possibly real) cost of rset scanning for a
> single region could be so high that it is never possible to select it
> for eviction given the asked for pause time goals. Hence, such a
> region effectively "leaks" until full GC.
>
> (2) The estimated (and possibly real) cost of rset scanning for
> regions may be so high that there are, in practice, always other
> regions selected for high pay-off/cost ratios, such that they end up
> never being collected even if theoretically a single region could be
> evicted within the pause time goal.
>
> These are effectively the same thing, with (1) being an extreme case of (2).
>
> In both cases, the effect should be mitigated (and have been in the
> case where I did my testing), but as far as I can tell not generally
> "fixed", by increasing the pause time goals.
>
> It is unclear to me how this is intended to be handled. The original
> g1 paper mentions an rset scanning thread that I may suspect would be
> intended to help do rset scanning in the background such that regions
> like these could be evicted more cheaply during the STW eviction
> pause; but I didn't find such a thread anywhere in the source code -
> but I may very well just be missing it.
>
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
Peter and Todd,
Any chance of setting +PrintHeapAtGC and -XX:+PrintHeapAtGCExtended and
sending us the log, or part of it (say between two Full GCs)? Be
prepared: this will generate piles of output. But it will give us
per-region information that might shed more light on the cause of the
issue.... thanks,
Tony, HS GC Group
Peter Schuller wrote:
>> Am I missing some tuning that should be done for G1GC for applications like
>> this? Is 20ms out of 80ms too aggressive a target for the garbage rates
>> we're generating?
>>
>
> I have never run HBase, but in an LRU stress test (I posted about it a
> few months ago) I specifically observed remembered set scanning costs
> go way up. In addition I was seeing fallbacks to full GC:s recently in
> a slightly different test that I also posed about to -use, and that
> turned out to be a result of the estimated rset scanning costs being
> so high that regions were never selected for eviction even though they
> had very little live data. I would be very interested to hear if
> you're having the same problem. My last post on the topic is here:
>
> http://mail.openjdk.java.net/pipermail/hotspot-gc-use/2010-June/000652.html
>
> Including the link to the (throw-away) patch that should tell you
> whether this is what's happening:
>
> http://distfiles.scode.org/mlref/g1/g1_region_live_stats_hack.patch
>
> Out of personal curiosity I'd be very interested to hear whether this
> is what's happening to you (in a real reasonable use-case rather than
> a synthetic benchmark).
>
> My sense (and hotspot/g1 developers please smack me around if I am
> misrepresenting anything here) is that the effect I saw (with rset
> scanning costs) could cause perpetual memory grow (until fallback to
> full GC) in two ways:
>
> (1) The estimated (and possibly real) cost of rset scanning for a
> single region could be so high that it is never possible to select it
> for eviction given the asked for pause time goals. Hence, such a
> region effectively "leaks" until full GC.
>
> (2) The estimated (and possibly real) cost of rset scanning for
> regions may be so high that there are, in practice, always other
> regions selected for high pay-off/cost ratios, such that they end up
> never being collected even if theoretically a single region could be
> evicted within the pause time goal.
>
> These are effectively the same thing, with (1) being an extreme case of (2).
>
> In both cases, the effect should be mitigated (and have been in the
> case where I did my testing), but as far as I can tell not generally
> "fixed", by increasing the pause time goals.
>
> It is unclear to me how this is intended to be handled. The original
> g1 paper mentions an rset scanning thread that I may suspect would be
> intended to help do rset scanning in the background such that regions
> like these could be evicted more cheaply during the STW eviction
> pause; but I didn't find such a thread anywhere in the source code -
> but I may very well just be missing it.
>
>
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
Ramki/Tony,
> Any chance of setting +PrintHeapAtGC and -XX:+PrintHeapAtGCExtended and
> sending us the log, or part of it (say between two Full GCs)? Be prepared:
> this will generate piles of output. But it will give us per-region
> information that might shed more light on the cause of the issue.... thanks,
So what I have in terms of data is (see footnotes for urls references in []):
(a) A patch[1] that prints some additional information about estimated
costs of region eviction, and disables the GC efficiency check that
normally terminates selection of regions. (Note: This is a throw-away
patch for debugging; it's not intended as a suggested change for
inclusion.)
(b) A log[2] showing the output of a test run I did just now, with
both your flags above and my patch enabled (but without disabling the
efficiency check). It shows fallback to full GC when the actual live
set size is 252 MB, and the maximum heap size is 2 GB (in other words,
~ 12% liveness). An easy way to find the point of full gc is to search
for the string 'full 1'.
(c) A file[3] with the effective VM options during the test.
(d) Instructions for how to run the test to reproduce it (I'll get to
that at the end; it's simplified relative to previously).
(e) Nature of the test.
Discussion:
WIth respect to region information: I originally tried it in response
to your recommendation earlier, but I found I did not see the
information I was after. Perhaps I was just misreading it, but I
mostly just saw either 0% or 100% fullness, and never the actual
liveness estimate as produced by the mark phase. In the log I am
referring to in this E-Mail, you can see that the last printout of
region information just before the live GC fits this pattern; I just
don't see anything that looks like legitimate liveness information
being printed. (I don't have time to dig back into it right now to
double-check what it's printing.)
If you scroll up from the point of the full gc until you find a bunch
of output starting with "predict_region_elapsed_time_ms" you see some
output resulting from the patch, with pretty extreme values such as:
predict_region_elapsed_time_ms: 34.378642ms total, 34.021154ms rs scan
(46542 cnum), 0.040069 copy time (20704 bytes), 0.317419 other time
predict_region_elapsed_time_ms: 45.020866ms total, 44.653222ms rs scan
(61087 cnum), 0.050225 copy time (25952 bytes), 0.317419 other time
predict_region_elapsed_time_ms: 16.250033ms total, 15.887065ms rs scan
(21734 cnum), 0.045550 copy time (23536 bytes), 0.317419 other time
predict_region_elapsed_time_ms: 226.942877ms total, 226.559163ms rs
scan (309940 cnum), 0.066296 copy time (34256 bytes), 0.317419 other
time
predict_region_elapsed_time_ms: 542.344828ms total, 541.954750ms rs
scan (741411 cnum), 0.072659 copy time (37544 bytes), 0.317419 other
time
predict_region_elapsed_time_ms: 668.595597ms total, 668.220877ms rs
scan (914147 cnum), 0.057301 copy time (29608 bytes), 0.317419 other
time
So in the most extreme case in the excerpt above, that's > half a
second of estimate rset scanning time for a single region with 914147
cards to be scanned. While not all are that extreme, lots and lots of
regions are very expensive and almost only due to rset scanning costs.
If you scroll down a bit to the first (and ONLY) partial that happened
after the statistics accumulating from the marking phase, we see more
output resulting form the patch. At the end, we see:
(picked region; 0.345890ms predicted; 1.317244ms remaining; 15kb
marked; 15kb maxlive; 1-1% liveness)
(393380 KB left in heap.)
(picked region; 0.345963ms predicted; 0.971354ms remaining; 15kb
marked; 15kb maxlive; 1-1% liveness)
(393365 KB left in heap.)
(picked region; 0.346036ms predicted; 0.625391ms remaining; 15kb
marked; 15kb maxlive; 1-1% liveness)
(393349 KB left in heap.)
(no more marked regions; next region too expensive (adaptive;
predicted 0.346036ms > remaining 0.279355ms))
So in other words, it picked a bunch of regions in order of "lowest
hanging fruit". The *least* low hanging fruit picked still had
liveness at 1%; in other words, there's plenty of further regions that
ideally should be collected because they contain almost no garbage
(ignoring the cost of collecting them).
In this case, it stopped picking regions because the next region to be
picked, though cheap, was the straw that broke the camel's neck and we
simply exceeded the alloted time for this particular GC.
However, after this partial completes, it reverts back to doing just
young gc:s. In other words, even though there's *plenty* of regions
with very low liveness, further partials aren't happening.
By applying this part of the patch:
- (adaptive_young_list_length() &&
+ (adaptive_young_list_length() && false && // scodetodo
I artificially force g1 to not fall back to doing young gc:s for
efficiency reasons. When I run with that change, I don't experience
the slow perpetual growth until fallback to full GC. If I remember
correctly though, the rset scanning cost is in fact high, but I don't
have details saved and I'm afraid I don't have time to re-run those
tests right now and compare numbers.
Reproducing it:
I made some changes and the test case should now hopefully be easy to
run assuming you have maven installed. The github project is at:
http://github.com/scode/httpgctest
There is a README, but the shortest possible instructions to
re-produce the test that I did:
git clone git://github.com/scode/httpgctest.git
cd httpgctest.git
git checkout 20100714_1 # grab from appropriate tag, in case I
change master
mvn package
HTTPGCTEST_LOGGC=gc.log ./run.sh
That should start the http server; then run concurrently:
while [ 1 ] ; do curl 'http://localhost:9191/dropdata?ratio=0.10' ;
curl 'http://localhost:9191/gendata?amount=25000' ; sleep 0.1 ; done
And then just wait and observe.
Nature of the test:
So the test if run as above will essentially reach a steady state of
equilibrium with about 25000 pieces of data in a clojure immutable
map. The result is that a significant amount of new data is being
allocated, but very little writing to old regions is happening. The
garbage generated is very well spread out over the entire heap because
it goes through all objects and drops 10% (the ratio=0.10) for each
iteration, after which it adds 25000 new items.
In other words; not a lot of old gen writing, but lots of writes to
the young gen referencing objects in the old gen.
[1] http://distfiles.scode.org/mlref/g1/g1_region_live_stats_hack.patch
[2] http://distfiles.scode.org/mlref/gctest/httpgctest-g1-fullgc-20100714/gc-fullfallback.log
[3] http://distfiles.scode.org/mlref/gctest/httpgctest-g1-fullgc-20100714/vmoptions.txt
--
/ Peter Schuller
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
> I consistently am seeing "Other time" estimates in the >40ms range. Given my
> pause time goal of 20ms, these estimates are I think excluding most of the
> regions from collectability. I haven't been able to dig around yet to figure
> out where the long estimate for "other time" is coming from - in the
> collections logged it sometimes shows fairly high "Other" but the "Choose
> CSet" component is very short.
(The following is wannabe speculation based on limited understanding
of the code, please take it with a grain of salt.)
My first thought here is swapping. My reading is that other time is
going to be the collection set selection time plus the collection set
free time (or at least intended to be). I think (am I wrong?) that
this should be really low under normal circumstances since no "bulk"
work is done really; in particular the *per-region* cost should be
low.
If the cost of these operations *per region* ended up being predicted
to > 40ms, I wonder if this was not due to swapping?
Additionally: As far as I can tell the estimated 'other' cost is based
on a history of the cost from previous GC:s and completely independent
of the particular region being evaluated.
Anyways, I suspect you've already confirmed that the system is not
actively swapping at the time of the fallback to full GC. But here is
one low-confidence hypothesis (it would be really great to hear from
one of the gc devs whether it is even remotely plausible):
* At some point in time, there was swapping happening affecting GC
operations such that the work done do gather stats and select regions
was slow (makes some sense since that should touch lots of distinct
regions and you don't need a lot of those memory accesses swapping to
accumulate quite a bit of time).
* This screwed up the 'other' cost history and thus the prediction,
possibly for both young and non-young regions.
* I believe young collections would never be entirely prevented due to
pause time goals, so here the cost history and thus predictions would
always have time to recover and you would not notice any effect
looking at the behavior of the system down the line.
* Non-young "other" cost was so high that non-young regions were never
selected. This in turn meant that additional cost history for the
"other" category was never recorded, preventing recovery from the
temporary swap storm.
* The end result is that no non-young regions are ever collected, and
you end up falling back to full GC once the young collections have
"leaked" enough garbage.
Thoughts, anyone?
--
/ Peter Schuller
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
> I consistently am seeing "Other time" estimates in the >40ms range. Given my
> pause time goal of 20ms, these estimates are I think excluding most of the
Btw, to test the hypothesis: When you say "constantly", are the times
in fact so consistent that it's either exactly the same or almost,
possibly being consistent with my proposed hypothesis that the
non-young "other" time is stuck? If the young other time is not stuck
I guess one might see some variation (I seem to get < 1 ms on my
machine) but not a lot at all in comparison to 40ms. If you're seeing
variation like 40-42 all the time, and it never decreasing
significantly after it reached the 40ms range, that would be
consistent with the hypothesis I believe.
--
/ Peter Schuller
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
> There shouldn't be any swapping during the tests - I've got RAM fairly
> carefully allocated and I believe swappiness was tuned down on those
> machines, though I will double check to be certain.
Does HBase mmap() significant amounts of memory for I/O purposes? I'm
not very familiar with HBase and a quick Googling didn't yield an
answer.
With extensive mmap():ed I/O, excessive swapping of the application
seems to be a common problem even with significant memory margins,
sometimes even with swapiness turned down to 0. I've seen it happen
under several circumstances, and based on reports on the
cassandra-user mailing list during the past couple of months it seems
I'm not alone.
To be sure I recommend checking actual swapping history (or at least
check that the absolute amount of memory swapped out is reasonable
over time).
> I'll try to read through your full email in detail while looking at the
> source and the G1 paper -- right now it's a bit above my head :)
Well, just to re-iterate though I have really only begun looking at it
myself and my ramblings may be completely off the mark.
> FWIW, my tests on JRockit JRRT's gcprio:deterministic collector didn't go
> much better - eventually it fell back to a full compaction which lasted 45
> seconds or so. HBase must be doing something that's really hard for GCs to
> deal with - either on the heuristics front or on the allocation pattern
> front.
Interesting. I don't know a lot about JRockit's implementation since
not a lot of information seems to be available. I did my LRU
micro-benchmark with a ~20-30 GB heap and JRockit. I could definitely
press it hard enough to cause a fallback, but that seemed to be
directly as a result of high allocation rates simply exceeding the
forward progress made by the GC (based on blackbox observation
anyway).
(The other problem was that the compaction pauses were never able to
complete; it seems compaction is O(n) with respect to the number of
objects being compacted, and I was unable to make it compact less than
1% per GC (because the command line option only accepted integral
percents), and with my object count the 1% was enough to hit the pause
time requirement so compaction was aborted every time. LIkely this
would have poor results over time as fragmentation becomes
significant.).
Does HBase go into periodic modes of very high allocation rate, or is
it fairly constant over time? I'm thinking that perhaps the concurrent
marking is just not triggered early enough and if large bursts of
allocations happen when the heap is relatively full, that might be the
triggering factor?
--
/ Peter Schuller
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
> Yep, I've seen JRRT also "abort compaction" on most compactions. I couldn't
> quite figure out how to tell it that it was fine to pause more often for
> compaction, so long as each pause was short.
FWIW, I got the impression at the time (but I don't remember why; I
think I was half-guessing based on assumptions about what it does and
several iterations through the documentation) that it was
fundamentally only *able* to do compaction during the stop-the-world
pause after a concurrent mark phase. I.e., I don't think you can make
it spread the work out (but I can most definitely be wrong).
--
/ Peter Schuller
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
A bit more data. I did the following patch:
@@ -1560,6 +1575,19 @@
_non_young_other_cost_per_region_ms_seq->add(non_young_other_time_ms
/
(double)
_recorded_non_young_regions);
+ } else {
+ // no non-young gen collections - if our prediction is high enough,
we would
+ // never collect non-young again, so push it back towards zero so we
give it
+ // another try.
+ double predicted_other_time = predict_non_young_other_time_ms(1);
+ if (predicted_other_time > MaxGCPauseMillis/2.0) {
+ if (G1PolicyVerbose > 0) {
+ gclog_or_tty->print_cr(
+ "Predicted non-young other time %.1f is too large compared to
max pause time. Weighting down.",
+ predicted_other_time);
+ }
+ _non_young_other_cost_per_region_ms_seq->add(0.0);
+ }
}
and this mostly solved the problem described above. Now I get a full GC
every 45-50 minutes which is way improved from what it was before.
I still seem to be putting off GC of non-young regions too much though. I
did some analysis of the G1 log and made these graphs:
http://people.apache.org/~todd/hbase-fragmentation/g1-graphing.png
The top graph is a heat map of the number of young (pink color) and
non-young (blue) in each collection.
The middle graph is the post-collection heap usage over time in MB
The bottom graph is a heat map and smoothed line graph of the number of
millis spent per collection. The target in this case is 50ms.
A few interesting things:
- not sure what causes the sort of periodic striated pattern in the number
of young generation regions chosen
- most of the time no old gen regions are selected for collection at all!
Here's a graph of just old regions:
http://people.apache.org/~todd/hbase-fragmentation/old-regions.png
- When old regions are actually selected for collection the heap usage does
drop, though elapsed time does spike over the guarantee.
So seems like something about the heuristics aren't quite right. Thoughts?
-Todd
On Fri, Jan 21, 2011 at 11:38 AM, Todd Lipcon <> wrote:
> Hey folks,
>
> Took some time over the last day or two to follow up on this on the latest
> checkout of JDK7. I added some more instrumentation and my findings so far
> are:
>
> 1) CMS is definitely hitting a fragmentation problem. Our workload is
> pretty much guaranteed to fragment, and I don't think there's anything CMS
> can do about it - see the following graphs:
> http://people.apache.org/~todd/hbase-fragmentation/
>
> 2) G1GC is hitting
> full pauses because the "other" pause time predictions end up higher than
> the minimum pause length. I'm seeing the following sequence:
>
> - A single choose_cset operation for a non_young region takes a long time
> (unclear yet why this is happening, see below)
> - This inflates the predict_non_young_other_time_ms(1) result to a value
> greater than my pause goal
> - From then on, it doesn't collect any more non-young regions (ever!)
> because any region will be considered expensive regardless of the estimated
> rset or collection costs
> - The heap slowly fills up with non-young regions until we reach a full GC
>
> 3) So the question is why the choose_cset is taking a long time. I added
> getrusage() calls to wrap the choose_cset operation. Here's some output with
> extra logging:
>
> --> About to choose cset at 725.458
> Adding 1 young regions to the CSet
> Added 0x0000000000000001 Young Regions to CS.
> (3596288 KB left in heap.)
> (picked region; 9.948053ms predicted; 21.164738ms remaining; 2448kb
> marked; 2448kb maxlive; 59-59% liveness)
> (3593839 KB left in heap.)
> predict_region_elapsed_time_ms: 10.493828ms total, 5.486072ms rs scan
> (14528 cnum), 4.828102 copy time (2333800 bytes), 0.179654 other time
> (picked region; 10.493828ms predicted; 11.216685ms remaining; 2279kb
> marked; 2279kb maxlive; 55-55% liveness)
> predict_region_elapsed_time_ms: 10.493828ms total, 5.486072ms rs scan
> (14528 cnum), 4.828102 copy time (2333800 bytes), 0.179654 other time
> (3591560 KB left in heap.)
> predict_region_elapsed_time_ms: 10.346346ms total, 5.119780ms rs scan
> (13558 cnum), 5.046912 copy time (2439568 bytes), 0.179654 other time
> predict_region_elapsed_time_ms: 10.407672ms total, 5.333135ms rs scan
> (14123 cnum), 4.894882 copy time (2366080 bytes), 0.179654 other time
> (no more marked regions; next region too expensive (adaptive; predicted
> 10.407672ms > remaining 0.722857ms))
> Resource usage of choose_cset:majflt: 0 nswap: 0 nvcsw: 6 nivcsw: 0
> --> About to prepare RS scan at 725.657
>
> The resource usage line with nvcsw=6 indicates there were 6 voluntary
> context switches while choosing cset. This choose_cset operation took
> 198.9ms all in choosing non-young.
>
> So, why are there voluntary context switches while choosing cset? This
> isn't swapping -- that should show under majflt, right? My only theories
> are:
> - are any locks acquired in choose_cset?
> - maybe the gc logging itself is blocking on IO to the log file? ie the
> instrumentation itself is interfering with the algorithm?
>
>
> Regardless, I think a single length choose_non_young_cset operation
> shouldn't be allowed to push the prediction above the time boundary and
> trigger this issue. Perhaps a simple workaround is that, whenever a
> collection chooses no non_young regions, it should contribute a value of 0
> to the average?
>
> I'll give this heuristic a try on my build and see if it solves the issue.
>
> -Todd
>
> On Tue, Jul 27, 2010 at 3:08 PM, Todd Lipcon <> wrote:
>
>> Hi all,
>>
>> Back from my vacation and took some time yesterday and today to build a
>> fresh JDK 7 with some additional debug printouts from Peter's patch.
>>
>> What I found was a bit different - the rset scanning estimates are low,
>> but I consistently am seeing "Other time" estimates in the >40ms range.
>> Given my pause time goal of 20ms, these estimates are I think excluding most
>> of the regions from collectability. I haven't been able to dig around yet to
>> figure out where the long estimate for "other time" is coming from - in the
>> collections logged it sometimes shows fairly high "Other" but the "Choose
>> CSet" component is very short. I'll try to add some more debug info to the
>> verbose logging and rerun some tests over the next couple of days.
>>
>> At the moment I'm giving the JRockit VM a try to see how its deterministic
>> GC stacks up against G1 and CMS.
>>
>> Thanks
>> -Todd
>>
>>
>> On Tue, Jul 13, 2010 at 5:15 PM, Peter Schuller <
>> > wrote:
>>
>>> Ramki/Tony,
>>>
>>> > Any chance of setting +PrintHeapAtGC and -XX:+PrintHeapAtGCExtended and
>>> > sending us the log, or part of it (say between two Full GCs)? Be
>>> prepared:
>>> > this will generate piles of output. But it will give us per-region
>>> > information that might shed more light on the cause of the issue....
>>> thanks,
>>>
>>> So what I have in terms of data is (see footnotes for urls references in
>>> []):
>>>
>>> (a) A patch[1] that prints some additional information about estimated
>>> costs of region eviction, and disables the GC efficiency check that
>>> normally terminates selection of regions. (Note: This is a throw-away
>>> patch for debugging; it's not intended as a suggested change for
>>> inclusion.)
>>>
>>> (b) A log[2] showing the output of a test run I did just now, with
>>> both your flags above and my patch enabled (but without disabling the
>>> efficiency check). It shows fallback to full GC when the actual live
>>> set size is 252 MB, and the maximum heap size is 2 GB (in other words,
>>> ~ 12% liveness). An easy way to find the point of full gc is to search
>>> for the string 'full 1'.
>>>
>>> (c) A file[3] with the effective VM options during the test.
>>>
>>> (d) Instructions for how to run the test to reproduce it (I'll get to
>>> that at the end; it's simplified relative to previously).
>>>
>>> (e) Nature of the test.
>>>
>>> Discussion:
>>>
>>> WIth respect to region information: I originally tried it in response
>>> to your recommendation earlier, but I found I did not see the
>>> information I was after. Perhaps I was just misreading it, but I
>>> mostly just saw either 0% or 100% fullness, and never the actual
>>> liveness estimate as produced by the mark phase. In the log I am
>>> referring to in this E-Mail, you can see that the last printout of
>>> region information just before the live GC fits this pattern; I just
>>> don't see anything that looks like legitimate liveness information
>>> being printed. (I don't have time to dig back into it right now to
>>> double-check what it's printing.)
>>>
>>> If you scroll up from the point of the full gc until you find a bunch
>>> of output starting with "predict_region_elapsed_time_ms" you see some
>>> output resulting from the patch, with pretty extreme values such as:
>>>
>>> predict_region_elapsed_time_ms: 34.378642ms total, 34.021154ms rs scan
>>> (46542 cnum), 0.040069 copy time (20704 bytes), 0.317419 other time
>>> predict_region_elapsed_time_ms: 45.020866ms total, 44.653222ms rs scan
>>> (61087 cnum), 0.050225 copy time (25952 bytes), 0.317419 other time
>>> predict_region_elapsed_time_ms: 16.250033ms total, 15.887065ms rs scan
>>> (21734 cnum), 0.045550 copy time (23536 bytes), 0.317419 other time
>>> predict_region_elapsed_time_ms: 226.942877ms total, 226.559163ms rs
>>> scan (309940 cnum), 0.066296 copy time (34256 bytes), 0.317419 other
>>> time
>>> predict_region_elapsed_time_ms: 542.344828ms total, 541.954750ms rs
>>> scan (741411 cnum), 0.072659 copy time (37544 bytes), 0.317419 other
>>> time
>>> predict_region_elapsed_time_ms: 668.595597ms total, 668.220877ms rs
>>> scan (914147 cnum), 0.057301 copy time (29608 bytes), 0.317419 other
>>> time
>>>
>>> So in the most extreme case in the excerpt above, that's > half a
>>> second of estimate rset scanning time for a single region with 914147
>>> cards to be scanned. While not all are that extreme, lots and lots of
>>> regions are very expensive and almost only due to rset scanning costs.
>>>
>>> If you scroll down a bit to the first (and ONLY) partial that happened
>>> after the statistics accumulating from the marking phase, we see more
>>> output resulting form the patch. At the end, we see:
>>>
>>> (picked region; 0.345890ms predicted; 1.317244ms remaining; 15kb
>>> marked; 15kb maxlive; 1-1% liveness)
>>> (393380 KB left in heap.)
>>> (picked region; 0.345963ms predicted; 0.971354ms remaining; 15kb
>>> marked; 15kb maxlive; 1-1% liveness)
>>> (393365 KB left in heap.)
>>> (picked region; 0.346036ms predicted; 0.625391ms remaining; 15kb
>>> marked; 15kb maxlive; 1-1% liveness)
>>> (393349 KB left in heap.)
>>> (no more marked regions; next region too expensive (adaptive;
>>> predicted 0.346036ms > remaining 0.279355ms))
>>>
>>> So in other words, it picked a bunch of regions in order of "lowest
>>> hanging fruit". The *least* low hanging fruit picked still had
>>> liveness at 1%; in other words, there's plenty of further regions that
>>> ideally should be collected because they contain almost no garbage
>>> (ignoring the cost of collecting them).
>>>
>>> In this case, it stopped picking regions because the next region to be
>>> picked, though cheap, was the straw that broke the camel's neck and we
>>> simply exceeded the alloted time for this particular GC.
>>>
>>> However, after this partial completes, it reverts back to doing just
>>> young gc:s. In other words, even though there's *plenty* of regions
>>> with very low liveness, further partials aren't happening.
>>>
>>> By applying this part of the patch:
>>>
>>> - (adaptive_young_list_length() &&
>>> + (adaptive_young_list_length() && false && // scodetodo
>>>
>>> I artificially force g1 to not fall back to doing young gc:s for
>>> efficiency reasons. When I run with that change, I don't experience
>>> the slow perpetual growth until fallback to full GC. If I remember
>>> correctly though, the rset scanning cost is in fact high, but I don't
>>> have details saved and I'm afraid I don't have time to re-run those
>>> tests right now and compare numbers.
>>>
>>> Reproducing it:
>>>
>>> I made some changes and the test case should now hopefully be easy to
>>> run assuming you have maven installed. The github project is at:
>>>
>>> http://github.com/scode/httpgctest
>>>
>>> There is a README, but the shortest possible instructions to
>>> re-produce the test that I did:
>>>
>>> git clone git://github.com/scode/httpgctest.git
>>> cd httpgctest.git
>>> git checkout 20100714_1 # grab from appropriate tag, in case I
>>> change master
>>> mvn package
>>> HTTPGCTEST_LOGGC=gc.log ./run.sh
>>>
>>> That should start the http server; then run concurrently:
>>>
>>> while [ 1 ] ; do curl 'http://localhost:9191/dropdata?ratio=0.10' ;
>>> curl 'http://localhost:9191/gendata?amount=25000' ; sleep 0.1 ; done
>>>
>>> And then just wait and observe.
>>>
>>> Nature of the test:
>>>
>>> So the test if run as above will essentially reach a steady state of
>>> equilibrium with about 25000 pieces of data in a clojure immutable
>>> map. The result is that a significant amount of new data is being
>>> allocated, but very little writing to old regions is happening. The
>>> garbage generated is very well spread out over the entire heap because
>>> it goes through all objects and drops 10% (the ratio=0.10) for each
>>> iteration, after which it adds 25000 new items.
>>>
>>> In other words; not a lot of old gen writing, but lots of writes to
>>> the young gen referencing objects in the old gen.
>>>
>>> [1] http://distfiles.scode.org/mlref/g1/g1_region_live_stats_hack.patch
>>> [2]
>>> http://distfiles.scode.org/mlref/gctest/httpgctest-g1-fullgc-20100714/gc-fullfallback.log
>>> [3]
>>> http://distfiles.scode.org/mlref/gctest/httpgctest-g1-fullgc-20100714/vmoptions.txt
>>>
>>> --
>>> / Peter Schuller
>>>
>>
>>
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>
--
Todd Lipcon
Software Engineer, Cloudera
> Â - most of the time no old gen regions are selected for collection at all!
> Here's a graph of just old regions:
> http://people.apache.org/~todd/hbase-fragmentation/old-regions.png
This is consistent with my anecdotal observations as well and I
believe it is expected. What I have observed happening is that
non-young (partial) collections always happen after the marking phases
some number of times, followed by young collections only until another
marking phase is triggered and completed.
I think this makes sense because region selection is based on cost
heuristics largely based on liveness data from marking. So you have
your marking phase followed by a period of decreasing availability of
non-young regions that are eligible for collection given the GC
efficiency goals (and the pause time goals), until there are 0 such.
Young collections then continue until unrelated criteria trigger a new
marking phase, giving non-young regions a chance again to get above
the eligibility watermark.
--
/ Peter Schuller
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
> I still seem to be putting off GC of non-young regions too much though. I
Part of my experiments I have been harping on was the below change to
cut GC efficiency out of the decision to perform non-young
collections. I'm not suggesting it actually be disabled, but perhaps
it can be adjusted to fit your workload? If there is nothing outright
wrong in terms of predictions and the problem is due to cost estimates
being too high, that may be a way to avoid full GC:s at the expense of
more expensive GC activity. This smells like something that should be
a tweakable VM option. Just like GCTimeRatio affects heap expansion
decisions, something to affect this (probably just a ratio applied to
the test below?).
Another thing: This is to a large part my human confirmation biased
brain speaking, but I would be really interested to find out if if the
slow build-up you seem to be experiencing is indeed due to rs scan
costs de to sparse table overflow (I've been harping about roughly the
same thing several times so maybe people are tired of it; most
recently in the thread "g1: dealing with high rates of inter-region
pointer writes").
Is your test easily runnable so that one can reproduce? Preferably
without lots of hbase/hadoop knowledge. I.e., is it something that can
be run in a self-contained fashion fairly easily?
Here's the patch indicating where to adjust the efficiency thresholding:
--- a/src/share/vm/gc_implementation/g1/g1CollectorPolicy.cpp Fri
Dec 17 23:32:58 2010 -0800
+++ b/src/share/vm/gc_implementation/g1/g1CollectorPolicy.cpp Sun
Jan 23 09:21:54 2011 +0100
@@ -1463,7 +1463,7 @@
if ( !_last_young_gc_full ) {
if ( _should_revert_to_full_young_gcs ||
_known_garbage_ratio < 0.05 ||
- (adaptive_young_list_length() &&
+ (adaptive_young_list_length() && //false && // scodetodo
(get_gc_eff_factor() * cur_efficiency < predict_young_gc_eff())) ) {
set_full_young_gcs(true);
}
--
/ Peter Schuller
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
Unfortunately my test is not easy to reproduce in its current form. But as I
look more and more into it, it looks like we're running into the same issue.
I added some code at the end of the mark phase that, after it sorts the
regions by efficiency, will print an object histogram for any regions that
are >98% garbage but very inefficient (<100KB/ms predicted collection rate)
Here's an example of an "uncollectable" region that is all garbage but for
one object:
Region 0x00002aaab0203e18 ( M1) [0x00002aaaf3800000, 0x00002aaaf3c00000]
Used: 4096K, garbage: 4095K. Eff: 6.448103 K/ms
Very low-occupancy low-efficiency region. Histogram:
num #instances #bytes class name
----------------------------------------------
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
|
# 20

24-01-2011 08:24 AM
|
|
|
Todd,
Could you send a segment of the GC logs from the beginning
through the first dozen or so full GC's?
Exactly which version of the JVM are you using?
java -version
will tell us.
Do you have a test setup where you could do some experiments?
Can you send the set of CMS flags you use? It might tell
us something about the GC behavior of you application.
Might not tell us anything but it's worth a look.
Jon
On 07/06/10 13:27, Todd Lipcon wrote:
> Hi all,
>
> I work on HBase, a distributed database written in Java. We generally
> run on large heaps (8GB+), and our object lifetime distribution has
> proven pretty problematic for garbage collection (we manage a multi-GB
> LRU cache inside the process, so in CMS we tenure a lot of byte arrays
> which later get collected).
>
> In Java6, we generally run with the CMS collector, tuning down
> CMSInitiatingOccupancyFraction and constraining MaxNewSize to achieve
> fairly low pause GC, but after a week or two of uptime we often run
> into full heap compaction which takes several minutes and wreaks havoc
> on the system.
>
> Needless to say, we've been watching the development of the G1 GC with
> anticipation for the last year or two. Finally it seems in the latest
> build of JDK7 it's stable enough for actual use (previously it would
> segfault within an hour or so). However, in my testing I'm seeing a
> fair amount of 8-10 second Full GC pauses.
>
> The flags I'm passing are:
>
> -Xmx8000m -XX:+UseG1GC -XX:MaxGCPauseMillis=20
> -XX:GCPauseIntervalMillis=80
>
> Most of the pauses I see in the GC log are around 10-20ms as expected:
>
> 2010-07-05T22:43:19.849-0700: 1680.079: [GC pause (partial),
> 0.01209000 secs]
> [Parallel Time: 10.5 ms]
> [Update RS (Start) (ms): 1680080.2 1680080.1 1680080.2
> 1680079.9 1680080.0 1680080.2 1680080.1 1680080.1 1680080.0
> 1680080.1 1680080.0 1680079.9 1680081.5]
> [Update RS (ms): 1.4 2.0 2.2 1.8 1.7 1.4 2.5 2.2 1.9
> 2.5 1.7 1.7 0.1
> Avg: 1.8, Min: 0.1, Max: 2.5]
> [Processed Buffers : 8 1 3 1 1 7 3 2 6 2 7 8 3
> Sum: 52, Avg: 4, Min: 1, Max: 8]
> [Ext Root Scanning (ms): 0.7 0.5 0.5 0.3 0.5 0.7 0.4 0.5
> 0.4 0.5 0.3 0.3 0.0
> Avg: 0.4, Min: 0.0, Max: 0.7]
> [Mark Stack Scanning (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0
> 0.0 0.0 0.0 0.0 0.0 0.0
> Avg: 0.0, Min: 0.0, Max: 0.0]
> [Scan RS (ms): 0.9 0.4 0.1 0.7 0.7 0.9 0.0 0.1 0.6 0.0
> 0.8 0.8 0.9
> Avg: 0.5, Min: 0.0, Max: 0.9]
> [Object Copy (ms): 7.2 7.2 7.3 7.3 7.1 7.1 7.0 7.2 7.1
> 6.9 7.1 7.1 7.0
> Avg: 7.1, Min: 6.9, Max: 7.3]
> [Termination (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
> 0.0 0.0 0.0 0.0
> Avg: 0.0, Min: 0.0, Max: 0.0]
> [Other: 0.6 ms]
> [Clear CT: 0.5 ms]
> [Other: 1.1 ms]
> [Choose CSet: 0.0 ms]
> [ 7677M->7636M(8000M)]
> [Times: user=0.12 sys=0.00, real=0.01 secs]
>
> But every 5-10 minutes I see a GC pause that lasts 10-15 seconds:
> [todd@monster01 logs]$ grep 'Full GC' gc-hbase.log | tail
> 2010-07-06T12:50:41.216-0700: 52521.446: [Full GC 7934M->4865M(8000M),
> 9.8907800 secs]
> 2010-07-06T12:55:39.802-0700: 52820.032: [Full GC 7930M->4964M(8000M),
> 9.9025520 secs]
> 2010-07-06T13:02:26.872-0700: 53227.102: [Full GC 7934M->4882M(8000M),
> 10.1232190 secs]
> 2010-07-06T13:09:41.049-0700: 53661.279: [Full GC 7938M->5002M(8000M),
> 10.4997760 secs]
> 2010-07-06T13:18:51.531-0700: 54211.761: [Full GC 7938M->4962M(8000M),
> 11.0497380 secs]
>
> These pauses are pretty unacceptable for soft real time operation.
>
> Am I missing some tuning that should be done for G1GC for applications
> like this? Is 20ms out of 80ms too aggressive a target for the garbage
> rates we're generating?
>
> My actual live heap usage should be very stable around 5GB - the
> application very carefully accounts its live object set at around 60%
> of the max heap (as you can see in the logs above).
>
> At this point we are considering doing crazy things like ripping out
> our main memory consumers into a custom slab allocator, and manually
> reference count the byte array slices. But, if we can get G1GC to work
> for us, it will save a lot of engineering on the application side!
>
> Thanks
> -Todd
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
> ------------------------------------------------------------------------
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
Did you try doubling the heap size? You might want to post a full
log so we can see what's happening between those full collections.
Also, If you have comparable CMS logs
all the better, as a known starting point. The full gc's almost
look like the heap got too full, so it must mean that incremental
collection is not keeping up with the rate of garbage generation.
Also, what's the JDK build you are running?
-- ramki
On 07/06/10 13:27, Todd Lipcon wrote:
> Hi all,
>
> I work on HBase, a distributed database written in Java. We generally
> run on large heaps (8GB+), and our object lifetime distribution has
> proven pretty problematic for garbage collection (we manage a multi-GB
> LRU cache inside the process, so in CMS we tenure a lot of byte arrays
> which later get collected).
>
> In Java6, we generally run with the CMS collector, tuning down
> CMSInitiatingOccupancyFraction and constraining MaxNewSize to achieve
> fairly low pause GC, but after a week or two of uptime we often run into
> full heap compaction which takes several minutes and wreaks havoc on the
> system.
>
> Needless to say, we've been watching the development of the G1 GC with
> anticipation for the last year or two. Finally it seems in the latest
> build of JDK7 it's stable enough for actual use (previously it would
> segfault within an hour or so). However, in my testing I'm seeing a fair
> amount of 8-10 second Full GC pauses.
>
> The flags I'm passing are:
>
> -Xmx8000m -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:GCPauseIntervalMillis=80
>
> Most of the pauses I see in the GC log are around 10-20ms as expected:
>
> 2010-07-05T22:43:19.849-0700: 1680.079: [GC pause (partial), 0.01209000
> secs]
> [Parallel Time: 10.5 ms]
> [Update RS (Start) (ms): 1680080.2 1680080.1 1680080.2
> 1680079.9 1680080.0 1680080.2 1680080.1 1680080.1 1680080.0
> 1680080.1 1680080.0 1680079.9 1680081.5]
> [Update RS (ms): 1.4 2.0 2.2 1.8 1.7 1.4 2.5 2.2 1.9 2.5
> 1.7 1.7 0.1
> Avg: 1.8, Min: 0.1, Max: 2.5]
> [Processed Buffers : 8 1 3 1 1 7 3 2 6 2 7 8 3
> Sum: 52, Avg: 4, Min: 1, Max: 8]
> [Ext Root Scanning (ms): 0.7 0.5 0.5 0.3 0.5 0.7 0.4 0.5
> 0.4 0.5 0.3 0.3 0.0
> Avg: 0.4, Min: 0.0, Max: 0.7]
> [Mark Stack Scanning (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
> 0.0 0.0 0.0 0.0 0.0
> Avg: 0.0, Min: 0.0, Max: 0.0]
> [Scan RS (ms): 0.9 0.4 0.1 0.7 0.7 0.9 0.0 0.1 0.6 0.0
> 0.8 0.8 0.9
> Avg: 0.5, Min: 0.0, Max: 0.9]
> [Object Copy (ms): 7.2 7.2 7.3 7.3 7.1 7.1 7.0 7.2 7.1
> 6.9 7.1 7.1 7.0
> Avg: 7.1, Min: 6.9, Max: 7.3]
> [Termination (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
> 0.0 0.0 0.0 0.0
> Avg: 0.0, Min: 0.0, Max: 0.0]
> [Other: 0.6 ms]
> [Clear CT: 0.5 ms]
> [Other: 1.1 ms]
> [Choose CSet: 0.0 ms]
> [ 7677M->7636M(8000M)]
> [Times: user=0.12 sys=0.00, real=0.01 secs]
>
> But every 5-10 minutes I see a GC pause that lasts 10-15 seconds:
> [todd@monster01 logs]$ grep 'Full GC' gc-hbase.log | tail
> 2010-07-06T12:50:41.216-0700: 52521.446: [Full GC 7934M->4865M(8000M),
> 9.8907800 secs]
> 2010-07-06T12:55:39.802-0700: 52820.032: [Full GC 7930M->4964M(8000M),
> 9.9025520 secs]
> 2010-07-06T13:02:26.872-0700: 53227.102: [Full GC 7934M->4882M(8000M),
> 10.1232190 secs]
> 2010-07-06T13:09:41.049-0700: 53661.279: [Full GC 7938M->5002M(8000M),
> 10.4997760 secs]
> 2010-07-06T13:18:51.531-0700: 54211.761: [Full GC 7938M->4962M(8000M),
> 11.0497380 secs]
>
> These pauses are pretty unacceptable for soft real time operation.
>
> Am I missing some tuning that should be done for G1GC for applications
> like this? Is 20ms out of 80ms too aggressive a target for the garbage
> rates we're generating?
>
> My actual live heap usage should be very stable around 5GB - the
> application very carefully accounts its live object set at around 60% of
> the max heap (as you can see in the logs above).
>
> At this point we are considering doing crazy things like ripping out our
> main memory consumers into a custom slab allocator, and manually
> reference count the byte array slices. But, if we can get G1GC to work
> for us, it will save a lot of engineering on the application side!
>
> Thanks
> -Todd
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>
>
> ------------------------------------------------------------------------
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
I also work with Todd on this systems (I am one of the other people
with the alternate CMS config) and doubling the heap size from 8 GB to
16GB is a little insane... we'd like to have some amount of reasonable
memory efficiency here... The thing is the more we can get out of our
ram for this block cache, the better performing our systems are. Also
a lot of the settings are self tuning, so if we up the Xmx the size of
the block cache is scaled as well.
-ryan
On Tue, Jul 6, 2010 at 2:12 PM, Y. S. Ramakrishna
<> wrote:
> Did you try doubling the heap size? You might want to post a full
> log so we can see what's happening between those full collections.
> Also, If you have comparable CMS logs
> all the better, as a known starting point. The full gc's almost
> look like the heap got too full, so it must mean that incremental
> collection is not keeping up with the rate of garbage generation.
> Also, what's the JDK build you are running?
>
> -- ramki
>
> On 07/06/10 13:27, Todd Lipcon wrote:
>> Hi all,
>>
>> I work on HBase, a distributed database written in Java. We generally
>> run on large heaps (8GB+), and our object lifetime distribution has
>> proven pretty problematic for garbage collection (we manage a multi-GB
>> LRU cache inside the process, so in CMS we tenure a lot of byte arrays
>> which later get collected).
>>
>> In Java6, we generally run with the CMS collector, tuning down
>> CMSInitiatingOccupancyFraction and constraining MaxNewSize to achieve
>> fairly low pause GC, but after a week or two of uptime we often run into
>> full heap compaction which takes several minutes and wreaks havoc on the
>> system.
>>
>> Needless to say, we've been watching the development of the G1 GC with
>> anticipation for the last year or two. Finally it seems in the latest
>> build of JDK7 it's stable enough for actual use (previously it would
>> segfault within an hour or so). However, in my testing I'm seeing a fair
>> amount of 8-10 second Full GC pauses.
>>
>> The flags I'm passing are:
>>
>> -Xmx8000m -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:GCPauseIntervalMillis=80
>>
>> Most of the pauses I see in the GC log are around 10-20ms as expected:
>>
>> 2010-07-05T22:43:19.849-0700: 1680.079: [GC pause (partial), 0.01209000
>> secs]
>> [Parallel Time: 10.5 ms]
>> [Update RS (Start) (ms): 1680080.2 1680080.1 1680080.2
>> 1680079.9 1680080.0 1680080.2 1680080.1 1680080.1 1680080.0
>> 1680080.1 1680080.0 1680079.9 1680081.5]
>> [Update RS (ms): 1.4 2.0 2.2 1.8 1.7 1.4 2.5 2.2 1.9 2.5
>> 1.7 1.7 0.1
>> Avg: 1.8, Min: 0.1, Max: 2.5]
>> [Processed Buffers : 8 1 3 1 1 7 3 2 6 2 7 8 3
>> Sum: 52, Avg: 4, Min: 1, Max: 8]
>> [Ext Root Scanning (ms): 0.7 0.5 0.5 0.3 0.5 0.7 0.4 0.5
>> 0.4 0.5 0.3 0.3 0.0
>> Avg: 0.4, Min: 0.0, Max: 0.7]
>> [Mark Stack Scanning (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
>> 0.0 0.0 0.0 0.0 0.0
>> Avg: 0.0, Min: 0.0, Max: 0.0]
>> [Scan RS (ms): 0.9 0.4 0.1 0.7 0.7 0.9 0.0 0.1 0.6 0.0
>> 0.8 0.8 0.9
>> Avg: 0.5, Min: 0.0, Max: 0.9]
>> [Object Copy (ms): 7.2 7.2 7.3 7.3 7.1 7.1 7.0 7.2 7.1
>> 6.9 7.1 7.1 7.0
>> Avg: 7.1, Min: 6.9, Max: 7.3]
>> [Termination (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
>> 0.0 0.0 0.0 0.0
>> Avg: 0.0, Min: 0.0, Max: 0.0]
>> [Other: 0.6 ms]
>> [Clear CT: 0.5 ms]
>> [Other: 1.1 ms]
>> [Choose CSet: 0.0 ms]
>> [ 7677M->7636M(8000M)]
>> [Times: user=0.12 sys=0.00, real=0.01 secs]
>>
>> But every 5-10 minutes I see a GC pause that lasts 10-15 seconds:
>> [todd@monster01 logs]$ grep 'Full GC' gc-hbase.log | tail
>> 2010-07-06T12:50:41.216-0700: 52521.446: [Full GC 7934M->4865M(8000M),
>> 9.8907800 secs]
>> 2010-07-06T12:55:39.802-0700: 52820.032: [Full GC 7930M->4964M(8000M),
>> 9.9025520 secs]
>> 2010-07-06T13:02:26.872-0700: 53227.102: [Full GC 7934M->4882M(8000M),
>> 10.1232190 secs]
>> 2010-07-06T13:09:41.049-0700: 53661.279: [Full GC 7938M->5002M(8000M),
>> 10.4997760 secs]
>> 2010-07-06T13:18:51.531-0700: 54211.761: [Full GC 7938M->4962M(8000M),
>> 11.0497380 secs]
>>
>> These pauses are pretty unacceptable for soft real time operation.
>>
>> Am I missing some tuning that should be done for G1GC for applications
>> like this? Is 20ms out of 80ms too aggressive a target for the garbage
>> rates we're generating?
>>
>> My actual live heap usage should be very stable around 5GB - the
>> application very carefully accounts its live object set at around 60% of
>> the max heap (as you can see in the logs above).
>>
>> At this point we are considering doing crazy things like ripping out our
>> main memory consumers into a custom slab allocator, and manually
>> reference count the byte array slices. But, if we can get G1GC to work
>> for us, it will save a lot of engineering on the application side!
>>
>> Thanks
>> -Todd
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
>>
>> ------------------------------------------------------------------------
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
Two questions:
(1) do you enable class unloading with CMS? (I do not see that
below in yr option list, but wondered.)
(2) does your application load classes, or intern a lot of strings?
If i am reading the logs right, G1 appears to reclaim less
and less of the heap in each cycle until a full collection
intervenes, and I have no real explanation for this behaviour
except that perhaps there's something in the perm gen that
keeps stuff in the remainder of the heap artificially live.
G1 does not incrementally collect the young gen, so this is
plausible. But CMS does not either by default and I do not see
that option in the CMS options list you gave below. It would
be instructive to see what the comparable CMS logs look like.
May be then you could start with the same heap shapes for the
two and see if you can get to the bottom of the full gc (which
as i understand you get to more quickly w/G1 than you did
w/CMS).
-- ramki
On 07/06/10 14:24, Todd Lipcon wrote:
> On Tue, Jul 6, 2010 at 2:09 PM, Jon Masamitsu <
> > wrote:
>
> Todd,
>
> Could you send a segment of the GC logs from the beginning
> through the first dozen or so full GC's?
>
>
> Sure, I just put it online at:
>
> http://cloudera-todd.s3.amazonaws.com/gc-log-g1gc.txt
>
>
>
> Exactly which version of the JVM are you using?
>
> java -version
>
> will tell us.
>
>
> Latest as of last night:
>
> [todd@monster01 ~]$ ./jdk1.7.0/jre/bin/java -version
> java version "1.7.0-ea"
> Java(TM) SE Runtime Environment (build 1.7.0-ea-b99)
> Java HotSpot(TM) 64-Bit Server VM (build 19.0-b03, mixed mode)
>
>
> Do you have a test setup where you could do some experiments?
>
>
> Sure, I have a five node cluster here where I do lots of testing, happy
> to try different builds/options/etc (though I probably don't have time
> to apply patches and rebuild the JDK myself)
>
>
> Can you send the set of CMS flags you use? It might tell
> us something about the GC behavior of you application.
> Might not tell us anything but it's worth a look.
>
>
> Different customers have found different flags to work well for them.
> One user uses the following:
>
>
> -Xmx12000m -XX:+HeapDumpOnOutOfMemoryError -XX:+UseConcMarkSweepGC \
> -XX:NewSize=64m -XX:MaxNewSize=64m -XX:CMSInitiatingOccupancyFraction=40 \
>
>
> Another uses:
>
>
> -XX:+DoEscapeAnalysis -XX:+AggressiveOpts -XX:+UseConcMarkSweepGC
> -XX:NewSize=64m -XX:MaxNewSize=64m
> -XX:CMSInitiatingOccupancyFraction=88 -verbose:gc -XX:+PrintGCDetails
>
>
>
>
>
>
>
> The particular tuning options probably depend on the actual cache
> workload of the user. I tend to recommend CMSInitiatingOccupancyFraction
> around 75 or so, since the software maintains about 60% heap usage. I
> also think a NewSize slightly larger would improve things a bit, but if
> it gets more than 256m or so, the ParNew pauses start to be too long for
> a lot of use cases.
>
> Regarding CMS logs, I can probably restart this test later this
> afternoon on CMS and run it for a couple hours, but it isn't likely to
> hit the multi-minute compaction that quickly. It happens more in the wild.
>
> -Todd
>
>
>
> On 07/06/10 13:27, Todd Lipcon wrote:
>> Hi all,
>>
>> I work on HBase, a distributed database written in Java. We
>> generally run on large heaps (8GB+), and our object lifetime
>> distribution has proven pretty problematic for garbage collection
>> (we manage a multi-GB LRU cache inside the process, so in CMS we
>> tenure a lot of byte arrays which later get collected).
>>
>> In Java6, we generally run with the CMS collector, tuning down
>> CMSInitiatingOccupancyFraction and constraining MaxNewSize to
>> achieve fairly low pause GC, but after a week or two of uptime we
>> often run into full heap compaction which takes several minutes
>> and wreaks havoc on the system.
>>
>> Needless to say, we've been watching the development of the G1 GC
>> with anticipation for the last year or two. Finally it seems in
>> the latest build of JDK7 it's stable enough for actual use
>> (previously it would segfault within an hour or so). However, in
>> my testing I'm seeing a fair amount of 8-10 second Full GC pauses.
>>
>> The flags I'm passing are:
>>
>> -Xmx8000m -XX:+UseG1GC -XX:MaxGCPauseMillis=20
>> -XX:GCPauseIntervalMillis=80
>>
>> Most of the pauses I see in the GC log are around 10-20ms as expected:
>>
>> 2010-07-05T22:43:19.849-0700: 1680.079: [GC pause (partial),
>> 0.01209000 secs]
>> [Parallel Time: 10.5 ms]
>> [Update RS (Start) (ms): 1680080.2 1680080.1 1680080.2
>> 1680079.9 1680080.0 1680080.2 1680080.1 1680080.1 1680080.0
>> 1680080.1 1680080.0 1680079.9 1680081.5]
>> [Update RS (ms): 1.4 2.0 2.2 1.8 1.7 1.4 2.5 2.2
>> 1.9 2.5 1.7 1.7 0.1
>> Avg: 1.8, Min: 0.1, Max: 2.5]
>> [Processed Buffers : 8 1 3 1 1 7 3 2 6 2 7 8 3
>> Sum: 52, Avg: 4, Min: 1, Max: 8]
>> [Ext Root Scanning (ms): 0.7 0.5 0.5 0.3 0.5 0.7 0.4
>> 0.5 0.4 0.5 0.3 0.3 0.0
>> Avg: 0.4, Min: 0.0, Max: 0.7]
>> [Mark Stack Scanning (ms): 0.0 0.0 0.0 0.0 0.0 0.0
>> 0.0 0.0 0.0 0.0 0.0 0.0 0.0
>> Avg: 0.0, Min: 0.0, Max: 0.0]
>> [Scan RS (ms): 0.9 0.4 0.1 0.7 0.7 0.9 0.0 0.1 0.6
>> 0.0 0.8 0.8 0.9
>> Avg: 0.5, Min: 0.0, Max: 0.9]
>> [Object Copy (ms): 7.2 7.2 7.3 7.3 7.1 7.1 7.0 7.2
>> 7.1 6.9 7.1 7.1 7.0
>> Avg: 7.1, Min: 6.9, Max: 7.3]
>> [Termination (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
>> 0.0 0.0 0.0 0.0 0.0
>> Avg: 0.0, Min: 0.0, Max: 0.0]
>> [Other: 0.6 ms]
>> [Clear CT: 0.5 ms]
>> [Other: 1.1 ms]
>> [Choose CSet: 0.0 ms]
>> [ 7677M->7636M(8000M)]
>> [Times: user=0.12 sys=0.00, real=0.01 secs]
>>
>> But every 5-10 minutes I see a GC pause that lasts 10-15 seconds:
>> [todd@monster01 logs]$ grep 'Full GC' gc-hbase.log | tail
>> 2010-07-06T12:50:41.216-0700: 52521.446: [Full GC
>> 7934M->4865M(8000M), 9.8907800 secs]
>> 2010-07-06T12:55:39.802-0700: 52820.032: [Full GC
>> 7930M->4964M(8000M), 9.9025520 secs]
>> 2010-07-06T13:02:26.872-0700: 53227.102: [Full GC
>> 7934M->4882M(8000M), 10.1232190 secs]
>> 2010-07-06T13:09:41.049-0700: 53661.279: [Full GC
>> 7938M->5002M(8000M), 10.4997760 secs]
>> 2010-07-06T13:18:51.531-0700: 54211.761: [Full GC
>> 7938M->4962M(8000M), 11.0497380 secs]
>>
>> These pauses are pretty unacceptable for soft real time operation.
>>
>> Am I missing some tuning that should be done for G1GC for
>> applications like this? Is 20ms out of 80ms too aggressive a
>> target for the garbage rates we're generating?
>>
>> My actual live heap usage should be very stable around 5GB - the
>> application very carefully accounts its live object set at around
>> 60% of the max heap (as you can see in the logs above).
>>
>> At this point we are considering doing crazy things like ripping
>> out our main memory consumers into a custom slab allocator, and
>> manually reference count the byte array slices. But, if we can get
>> G1GC to work for us, it will save a lot of engineering on the
>> application side!
>>
>> Thanks
>> -Todd
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>> ------------------------------------------------------------------------
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
On 07/07/10 08:45, Todd Lipcon wrote:
...
>
> Overnight I saw one "concurrent mode failure".
...
> 2010-07-07T07:56:27.786-0700: 28490.203: [GC 28490.203: [ParNew
> (promotion failed): 59008K->59008K(59008K), 0.0179250 secs]28490.221:
> [CMS2010-07-07T07:56:27.901-0700: 28490.317: [CMS-concurrent-preclean:
> 0.556/0.947 secs] [Times:
> user=5.76 sys=0.26, real=0.95 secs]
> (concurrent mode failure): 6359176K->4206871K(8323072K), 17.4366220
> secs] 6417373K->4206871K(8382080K), [CMS Perm : 18609K->18565K(31048K)],
> 17.4546890 secs] [Times: user=11.17 sys=0.09, real=17.45 secs]
>
> I've interpreted pauses like this as being caused by fragmentation,
> since the young gen is 64M, and the old gen here has about 2G free. If
> there's something I'm not understanding about CMS, and I can tune it
> more smartly to avoid these longer pauses, I'm happy to try.
Yes the old gen must be fragmented. I'll look at the data you have
made available (for CMS). The CMS log you uploaded does not have the
suffix leading into the concurrent mode failure ypu display above
(it stops less than 2500 s into the run). If you could include
the entire log leading into the concurrent mode failures, it would
be a great help. Do you have large arrays in your
application? The shape of the promotion graph for CMS is somewhat
jagged, indicating _perhaps_ that. Yes, +PrintTenuringDistribution
would shed a bit more light. As regards fragmentation, it can be
tricky to tune against, but we can try once we understand a bit
more about the object sizes and demographics.
I am sure you don't have an easily shared test case, so we
can reproduce both the CMS fragmentation and the G1 full gc
issues locally for quickest progress on this?
-- ramki
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
On 07/07/10 17:18, Todd Lipcon wrote:
...
> Looking at the graph you attached, it appears that the low-water mark
> stabilizes at somewhere between 4.5G and 5G. The configuration I'm
> running is to allocate 40% of the heap to Memstore and 20% of the heap
> to the LRU cache. For an 8G heap, this is 4.8GB. So, for this
> application it's somewhat expected that, as it runs, it will accumulate
> more and more data until it reaches this threshold. The data is, of
> course, not *permanent*, but it's reasonably long-lived, so it makes
> sense to me that it should go into the old generation.
Ah, i see. In that case, i think you could try using a slightly larger old
gen. If the old gen stabilizes at 4.2 GB, we should allow as much for slop.
i.e. make the old gen 8.4 GB (or whatever is the measured stable
old gen occupancy), then add to that the young gen size, and use
that for the whole heap. I would be even more aggressive
and grant more to the old gen -- as i said earlier perhaps
double the old gen from its present size. If that doesn;t work
we know that something is amiss in the way we are going at this.
If it works, we can iterate downwards from a config that we know
works, down to what may be considered an acceptable space overhead
for GC.
>
> If you like, I can tune down those percentages to 20/20 instead of
> 20/40, and I think we'll see the same pattern, just stabilized around
> 3.2GB. This will probably delay the full GCs, but still eventually hit
> them. It's also way lower than we can really go - customers won't like
> "throwing away" 60% of the allocated heap to GC!
I understand that sentiment. I want us to get to a state where we are able
to completely avoid the creeping fragmentation, if possible. There are
other ways to tune for this, but they are more labour-intensive and tricky,
and I would not want to go into that lightly. You might want to contact
your Java support for help with that.
>
>
> Perhaps jmap -histo:live or +PrintClassHistogram[AfterFullGC]
> will help you get to the bottom of yr leak. Once the leak is plugged
> perhaps we could come back to the G1 tuning effort? (We have some
> guesses as to what might be happening and the best G1 minds are
> chewing on the info you provided so far, for which thanks!)
>
>
> I can try running with those options and see what I see, but I've
> already spent some time looking at heap dumps, and not found any leaks,
> so I'm pretty sure it's not the issue.
OK, in that case it's not worth doing, since you've already ruled
out leaks.
I'll think some more about this meanwhile.
thanks.
-- ramki
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
On 07/07/10 17:32, Todd Lipcon wrote:
...
> OK, I can try some tests with cache configured for only 40% heap usage.
> Should I run these tests with CMS or G1?
I'd first try CMS, and if that works, try G1.
>
>
>
>
> If you like, I can tune down those percentages to 20/20 instead
> of 20/40, and I think we'll see the same pattern, just
> stabilized around 3.2GB. This will probably delay the full GCs,
> but still eventually hit them. It's also way lower than we can
> really go - customers won't like "throwing away" 60% of the
> allocated heap to GC!
>
>
> I understand that sentiment. I want us to get to a state where we
> are able
> to completely avoid the creeping fragmentation, if possible. There are
> other ways to tune for this, but they are more labour-intensive and
> tricky,
> and I would not want to go into that lightly. You might want to contact
> your Java support for help with that.
>
>
> Yep, we've considered various solutions involving managing our own
> ref-counted slices of a single pre-allocated byte array - essentially
> writing our own slab allocator. In theory this should make all of the
> GCable objects constrained to a small number of sizes, and thus prevent
> fragmentation, but it's quite a project to undertake :)
That would be overdoing it. I didn't mean anything so drastic and certainly
nothing so drastic at the application level. When I said "labour intensive"
I meant tuning GC to avoid that kind of fragmentation would be more work.
>
> Regarding Java support, as an open source project we have no such
> luxury. Projects like HBase and Hadoop, though, are pretty visible to
> users as "big Java apps", so getting them working well on the GC front
> does good things for Java adoption in the database/distributed systems
> community, I think.
I agree, and we certainly should.
-- ramki
>
> -Todd
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>
>
> ------------------------------------------------------------------------
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
> Am I missing some tuning that should be done for G1GC for applications like
> this? Is 20ms out of 80ms too aggressive a target for the garbage rates
> we're generating?
I have never run HBase, but in an LRU stress test (I posted about it a
few months ago) I specifically observed remembered set scanning costs
go way up. In addition I was seeing fallbacks to full GC:s recently in
a slightly different test that I also posed about to -use, and that
turned out to be a result of the estimated rset scanning costs being
so high that regions were never selected for eviction even though they
had very little live data. I would be very interested to hear if
you're having the same problem. My last post on the topic is here:
http://mail.openjdk.java.net/pipermail/hotspot-gc-use/2010-June/000652.html
Including the link to the (throw-away) patch that should tell you
whether this is what's happening:
http://distfiles.scode.org/mlref/g1/g1_region_live_stats_hack.patch
Out of personal curiosity I'd be very interested to hear whether this
is what's happening to you (in a real reasonable use-case rather than
a synthetic benchmark).
My sense (and hotspot/g1 developers please smack me around if I am
misrepresenting anything here) is that the effect I saw (with rset
scanning costs) could cause perpetual memory grow (until fallback to
full GC) in two ways:
(1) The estimated (and possibly real) cost of rset scanning for a
single region could be so high that it is never possible to select it
for eviction given the asked for pause time goals. Hence, such a
region effectively "leaks" until full GC.
(2) The estimated (and possibly real) cost of rset scanning for
regions may be so high that there are, in practice, always other
regions selected for high pay-off/cost ratios, such that they end up
never being collected even if theoretically a single region could be
evicted within the pause time goal.
These are effectively the same thing, with (1) being an extreme case of (2).
In both cases, the effect should be mitigated (and have been in the
case where I did my testing), but as far as I can tell not generally
"fixed", by increasing the pause time goals.
It is unclear to me how this is intended to be handled. The original
g1 paper mentions an rset scanning thread that I may suspect would be
intended to help do rset scanning in the background such that regions
like these could be evicted more cheaply during the STW eviction
pause; but I didn't find such a thread anywhere in the source code -
but I may very well just be missing it.
--
/ Peter Schuller
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
Hi Peter --
Yes, my guess was also that something (possibly along the lines
you stated below) was preventing the selection of certain (sets
of) regions for evacuation on a regular basis ... I am told there
are flags that will allow you to get verbose details on what is
or is not selected for inclusion in the collection set; perhaps
that will help you get down to the bottom of this. Did you say
you had a test case that showed this behaviour? Filing a bug
with that test case may be the quickest way to get this before
the right set of eyes. Over to the G1 cognoscenti.
-- ramki
On 07/12/10 09:02, Peter Schuller wrote:
>> Am I missing some tuning that should be done for G1GC for applications like
>> this? Is 20ms out of 80ms too aggressive a target for the garbage rates
>> we're generating?
>
> I have never run HBase, but in an LRU stress test (I posted about it a
> few months ago) I specifically observed remembered set scanning costs
> go way up. In addition I was seeing fallbacks to full GC:s recently in
> a slightly different test that I also posed about to -use, and that
> turned out to be a result of the estimated rset scanning costs being
> so high that regions were never selected for eviction even though they
> had very little live data. I would be very interested to hear if
> you're having the same problem. My last post on the topic is here:
>
> http://mail.openjdk.java.net/pipermail/hotspot-gc-use/2010-June/000652.html
>
> Including the link to the (throw-away) patch that should tell you
> whether this is what's happening:
>
> http://distfiles.scode.org/mlref/g1/g1_region_live_stats_hack.patch
>
> Out of personal curiosity I'd be very interested to hear whether this
> is what's happening to you (in a real reasonable use-case rather than
> a synthetic benchmark).
>
> My sense (and hotspot/g1 developers please smack me around if I am
> misrepresenting anything here) is that the effect I saw (with rset
> scanning costs) could cause perpetual memory grow (until fallback to
> full GC) in two ways:
>
> (1) The estimated (and possibly real) cost of rset scanning for a
> single region could be so high that it is never possible to select it
> for eviction given the asked for pause time goals. Hence, such a
> region effectively "leaks" until full GC.
>
> (2) The estimated (and possibly real) cost of rset scanning for
> regions may be so high that there are, in practice, always other
> regions selected for high pay-off/cost ratios, such that they end up
> never being collected even if theoretically a single region could be
> evicted within the pause time goal.
>
> These are effectively the same thing, with (1) being an extreme case of (2).
>
> In both cases, the effect should be mitigated (and have been in the
> case where I did my testing), but as far as I can tell not generally
> "fixed", by increasing the pause time goals.
>
> It is unclear to me how this is intended to be handled. The original
> g1 paper mentions an rset scanning thread that I may suspect would be
> intended to help do rset scanning in the background such that regions
> like these could be evicted more cheaply during the STW eviction
> pause; but I didn't find such a thread anywhere in the source code -
> but I may very well just be missing it.
>
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
Peter and Todd,
Any chance of setting +PrintHeapAtGC and -XX:+PrintHeapAtGCExtended and
sending us the log, or part of it (say between two Full GCs)? Be
prepared: this will generate piles of output. But it will give us
per-region information that might shed more light on the cause of the
issue.... thanks,
Tony, HS GC Group
Peter Schuller wrote:
>> Am I missing some tuning that should be done for G1GC for applications like
>> this? Is 20ms out of 80ms too aggressive a target for the garbage rates
>> we're generating?
>>
>
> I have never run HBase, but in an LRU stress test (I posted about it a
> few months ago) I specifically observed remembered set scanning costs
> go way up. In addition I was seeing fallbacks to full GC:s recently in
> a slightly different test that I also posed about to -use, and that
> turned out to be a result of the estimated rset scanning costs being
> so high that regions were never selected for eviction even though they
> had very little live data. I would be very interested to hear if
> you're having the same problem. My last post on the topic is here:
>
> http://mail.openjdk.java.net/pipermail/hotspot-gc-use/2010-June/000652.html
>
> Including the link to the (throw-away) patch that should tell you
> whether this is what's happening:
>
> http://distfiles.scode.org/mlref/g1/g1_region_live_stats_hack.patch
>
> Out of personal curiosity I'd be very interested to hear whether this
> is what's happening to you (in a real reasonable use-case rather than
> a synthetic benchmark).
>
> My sense (and hotspot/g1 developers please smack me around if I am
> misrepresenting anything here) is that the effect I saw (with rset
> scanning costs) could cause perpetual memory grow (until fallback to
> full GC) in two ways:
>
> (1) The estimated (and possibly real) cost of rset scanning for a
> single region could be so high that it is never possible to select it
> for eviction given the asked for pause time goals. Hence, such a
> region effectively "leaks" until full GC.
>
> (2) The estimated (and possibly real) cost of rset scanning for
> regions may be so high that there are, in practice, always other
> regions selected for high pay-off/cost ratios, such that they end up
> never being collected even if theoretically a single region could be
> evicted within the pause time goal.
>
> These are effectively the same thing, with (1) being an extreme case of (2).
>
> In both cases, the effect should be mitigated (and have been in the
> case where I did my testing), but as far as I can tell not generally
> "fixed", by increasing the pause time goals.
>
> It is unclear to me how this is intended to be handled. The original
> g1 paper mentions an rset scanning thread that I may suspect would be
> intended to help do rset scanning in the background such that regions
> like these could be evicted more cheaply during the STW eviction
> pause; but I didn't find such a thread anywhere in the source code -
> but I may very well just be missing it.
>
>
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
Ramki/Tony,
> Any chance of setting +PrintHeapAtGC and -XX:+PrintHeapAtGCExtended and
> sending us the log, or part of it (say between two Full GCs)? Be prepared:
> this will generate piles of output. But it will give us per-region
> information that might shed more light on the cause of the issue.... thanks,
So what I have in terms of data is (see footnotes for urls references in []):
(a) A patch[1] that prints some additional information about estimated
costs of region eviction, and disables the GC efficiency check that
normally terminates selection of regions. (Note: This is a throw-away
patch for debugging; it's not intended as a suggested change for
inclusion.)
(b) A log[2] showing the output of a test run I did just now, with
both your flags above and my patch enabled (but without disabling the
efficiency check). It shows fallback to full GC when the actual live
set size is 252 MB, and the maximum heap size is 2 GB (in other words,
~ 12% liveness). An easy way to find the point of full gc is to search
for the string 'full 1'.
(c) A file[3] with the effective VM options during the test.
(d) Instructions for how to run the test to reproduce it (I'll get to
that at the end; it's simplified relative to previously).
(e) Nature of the test.
Discussion:
WIth respect to region information: I originally tried it in response
to your recommendation earlier, but I found I did not see the
information I was after. Perhaps I was just misreading it, but I
mostly just saw either 0% or 100% fullness, and never the actual
liveness estimate as produced by the mark phase. In the log I am
referring to in this E-Mail, you can see that the last printout of
region information just before the live GC fits this pattern; I just
don't see anything that looks like legitimate liveness information
being printed. (I don't have time to dig back into it right now to
double-check what it's printing.)
If you scroll up from the point of the full gc until you find a bunch
of output starting with "predict_region_elapsed_time_ms" you see some
output resulting from the patch, with pretty extreme values such as:
predict_region_elapsed_time_ms: 34.378642ms total, 34.021154ms rs scan
(46542 cnum), 0.040069 copy time (20704 bytes), 0.317419 other time
predict_region_elapsed_time_ms: 45.020866ms total, 44.653222ms rs scan
(61087 cnum), 0.050225 copy time (25952 bytes), 0.317419 other time
predict_region_elapsed_time_ms: 16.250033ms total, 15.887065ms rs scan
(21734 cnum), 0.045550 copy time (23536 bytes), 0.317419 other time
predict_region_elapsed_time_ms: 226.942877ms total, 226.559163ms rs
scan (309940 cnum), 0.066296 copy time (34256 bytes), 0.317419 other
time
predict_region_elapsed_time_ms: 542.344828ms total, 541.954750ms rs
scan (741411 cnum), 0.072659 copy time (37544 bytes), 0.317419 other
time
predict_region_elapsed_time_ms: 668.595597ms total, 668.220877ms rs
scan (914147 cnum), 0.057301 copy time (29608 bytes), 0.317419 other
time
So in the most extreme case in the excerpt above, that's > half a
second of estimate rset scanning time for a single region with 914147
cards to be scanned. While not all are that extreme, lots and lots of
regions are very expensive and almost only due to rset scanning costs.
If you scroll down a bit to the first (and ONLY) partial that happened
after the statistics accumulating from the marking phase, we see more
output resulting form the patch. At the end, we see:
(picked region; 0.345890ms predicted; 1.317244ms remaining; 15kb
marked; 15kb maxlive; 1-1% liveness)
(393380 KB left in heap.)
(picked region; 0.345963ms predicted; 0.971354ms remaining; 15kb
marked; 15kb maxlive; 1-1% liveness)
(393365 KB left in heap.)
(picked region; 0.346036ms predicted; 0.625391ms remaining; 15kb
marked; 15kb maxlive; 1-1% liveness)
(393349 KB left in heap.)
(no more marked regions; next region too expensive (adaptive;
predicted 0.346036ms > remaining 0.279355ms))
So in other words, it picked a bunch of regions in order of "lowest
hanging fruit". The *least* low hanging fruit picked still had
liveness at 1%; in other words, there's plenty of further regions that
ideally should be collected because they contain almost no garbage
(ignoring the cost of collecting them).
In this case, it stopped picking regions because the next region to be
picked, though cheap, was the straw that broke the camel's neck and we
simply exceeded the alloted time for this particular GC.
However, after this partial completes, it reverts back to doing just
young gc:s. In other words, even though there's *plenty* of regions
with very low liveness, further partials aren't happening.
By applying this part of the patch:
- (adaptive_young_list_length() &&
+ (adaptive_young_list_length() && false && // scodetodo
I artificially force g1 to not fall back to doing young gc:s for
efficiency reasons. When I run with that change, I don't experience
the slow perpetual growth until fallback to full GC. If I remember
correctly though, the rset scanning cost is in fact high, but I don't
have details saved and I'm afraid I don't have time to re-run those
tests right now and compare numbers.
Reproducing it:
I made some changes and the test case should now hopefully be easy to
run assuming you have maven installed. The github project is at:
http://github.com/scode/httpgctest
There is a README, but the shortest possible instructions to
re-produce the test that I did:
git clone git://github.com/scode/httpgctest.git
cd httpgctest.git
git checkout 20100714_1 # grab from appropriate tag, in case I
change master
mvn package
HTTPGCTEST_LOGGC=gc.log ./run.sh
That should start the http server; then run concurrently:
while [ 1 ] ; do curl 'http://localhost:9191/dropdata?ratio=0.10' ;
curl 'http://localhost:9191/gendata?amount=25000' ; sleep 0.1 ; done
And then just wait and observe.
Nature of the test:
So the test if run as above will essentially reach a steady state of
equilibrium with about 25000 pieces of data in a clojure immutable
map. The result is that a significant amount of new data is being
allocated, but very little writing to old regions is happening. The
garbage generated is very well spread out over the entire heap because
it goes through all objects and drops 10% (the ratio=0.10) for each
iteration, after which it adds 25000 new items.
In other words; not a lot of old gen writing, but lots of writes to
the young gen referencing objects in the old gen.
[1] http://distfiles.scode.org/mlref/g1/g1_region_live_stats_hack.patch
[2] http://distfiles.scode.org/mlref/gctest/httpgctest-g1-fullgc-20100714/gc-fullfallback.log
[3] http://distfiles.scode.org/mlref/gctest/httpgctest-g1-fullgc-20100714/vmoptions.txt
--
/ Peter Schuller
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
> I consistently am seeing "Other time" estimates in the >40ms range. Given my
> pause time goal of 20ms, these estimates are I think excluding most of the
> regions from collectability. I haven't been able to dig around yet to figure
> out where the long estimate for "other time" is coming from - in the
> collections logged it sometimes shows fairly high "Other" but the "Choose
> CSet" component is very short.
(The following is wannabe speculation based on limited understanding
of the code, please take it with a grain of salt.)
My first thought here is swapping. My reading is that other time is
going to be the collection set selection time plus the collection set
free time (or at least intended to be). I think (am I wrong?) that
this should be really low under normal circumstances since no "bulk"
work is done really; in particular the *per-region* cost should be
low.
If the cost of these operations *per region* ended up being predicted
to > 40ms, I wonder if this was not due to swapping?
Additionally: As far as I can tell the estimated 'other' cost is based
on a history of the cost from previous GC:s and completely independent
of the particular region being evaluated.
Anyways, I suspect you've already confirmed that the system is not
actively swapping at the time of the fallback to full GC. But here is
one low-confidence hypothesis (it would be really great to hear from
one of the gc devs whether it is even remotely plausible):
* At some point in time, there was swapping happening affecting GC
operations such that the work done do gather stats and select regions
was slow (makes some sense since that should touch lots of distinct
regions and you don't need a lot of those memory accesses swapping to
accumulate quite a bit of time).
* This screwed up the 'other' cost history and thus the prediction,
possibly for both young and non-young regions.
* I believe young collections would never be entirely prevented due to
pause time goals, so here the cost history and thus predictions would
always have time to recover and you would not notice any effect
looking at the behavior of the system down the line.
* Non-young "other" cost was so high that non-young regions were never
selected. This in turn meant that additional cost history for the
"other" category was never recorded, preventing recovery from the
temporary swap storm.
* The end result is that no non-young regions are ever collected, and
you end up falling back to full GC once the young collections have
"leaked" enough garbage.
Thoughts, anyone?
--
/ Peter Schuller
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
> I consistently am seeing "Other time" estimates in the >40ms range. Given my
> pause time goal of 20ms, these estimates are I think excluding most of the
Btw, to test the hypothesis: When you say "constantly", are the times
in fact so consistent that it's either exactly the same or almost,
possibly being consistent with my proposed hypothesis that the
non-young "other" time is stuck? If the young other time is not stuck
I guess one might see some variation (I seem to get < 1 ms on my
machine) but not a lot at all in comparison to 40ms. If you're seeing
variation like 40-42 all the time, and it never decreasing
significantly after it reached the 40ms range, that would be
consistent with the hypothesis I believe.
--
/ Peter Schuller
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
> There shouldn't be any swapping during the tests - I've got RAM fairly
> carefully allocated and I believe swappiness was tuned down on those
> machines, though I will double check to be certain.
Does HBase mmap() significant amounts of memory for I/O purposes? I'm
not very familiar with HBase and a quick Googling didn't yield an
answer.
With extensive mmap():ed I/O, excessive swapping of the application
seems to be a common problem even with significant memory margins,
sometimes even with swapiness turned down to 0. I've seen it happen
under several circumstances, and based on reports on the
cassandra-user mailing list during the past couple of months it seems
I'm not alone.
To be sure I recommend checking actual swapping history (or at least
check that the absolute amount of memory swapped out is reasonable
over time).
> I'll try to read through your full email in detail while looking at the
> source and the G1 paper -- right now it's a bit above my head :)
Well, just to re-iterate though I have really only begun looking at it
myself and my ramblings may be completely off the mark.
> FWIW, my tests on JRockit JRRT's gcprio:deterministic collector didn't go
> much better - eventually it fell back to a full compaction which lasted 45
> seconds or so. HBase must be doing something that's really hard for GCs to
> deal with - either on the heuristics front or on the allocation pattern
> front.
Interesting. I don't know a lot about JRockit's implementation since
not a lot of information seems to be available. I did my LRU
micro-benchmark with a ~20-30 GB heap and JRockit. I could definitely
press it hard enough to cause a fallback, but that seemed to be
directly as a result of high allocation rates simply exceeding the
forward progress made by the GC (based on blackbox observation
anyway).
(The other problem was that the compaction pauses were never able to
complete; it seems compaction is O(n) with respect to the number of
objects being compacted, and I was unable to make it compact less than
1% per GC (because the command line option only accepted integral
percents), and with my object count the 1% was enough to hit the pause
time requirement so compaction was aborted every time. LIkely this
would have poor results over time as fragmentation becomes
significant.).
Does HBase go into periodic modes of very high allocation rate, or is
it fairly constant over time? I'm thinking that perhaps the concurrent
marking is just not triggered early enough and if large bursts of
allocations happen when the heap is relatively full, that might be the
triggering factor?
--
/ Peter Schuller
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
> Yep, I've seen JRRT also "abort compaction" on most compactions. I couldn't
> quite figure out how to tell it that it was fine to pause more often for
> compaction, so long as each pause was short.
FWIW, I got the impression at the time (but I don't remember why; I
think I was half-guessing based on assumptions about what it does and
several iterations through the documentation) that it was
fundamentally only *able* to do compaction during the stop-the-world
pause after a concurrent mark phase. I.e., I don't think you can make
it spread the work out (but I can most definitely be wrong).
--
/ Peter Schuller
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
A bit more data. I did the following patch:
@@ -1560,6 +1575,19 @@
_non_young_other_cost_per_region_ms_seq->add(non_young_other_time_ms
/
(double)
_recorded_non_young_regions);
+ } else {
+ // no non-young gen collections - if our prediction is high enough,
we would
+ // never collect non-young again, so push it back towards zero so we
give it
+ // another try.
+ double predicted_other_time = predict_non_young_other_time_ms(1);
+ if (predicted_other_time > MaxGCPauseMillis/2.0) {
+ if (G1PolicyVerbose > 0) {
+ gclog_or_tty->print_cr(
+ "Predicted non-young other time %.1f is too large compared to
max pause time. Weighting down.",
+ predicted_other_time);
+ }
+ _non_young_other_cost_per_region_ms_seq->add(0.0);
+ }
}
and this mostly solved the problem described above. Now I get a full GC
every 45-50 minutes which is way improved from what it was before.
I still seem to be putting off GC of non-young regions too much though. I
did some analysis of the G1 log and made these graphs:
http://people.apache.org/~todd/hbase-fragmentation/g1-graphing.png
The top graph is a heat map of the number of young (pink color) and
non-young (blue) in each collection.
The middle graph is the post-collection heap usage over time in MB
The bottom graph is a heat map and smoothed line graph of the number of
millis spent per collection. The target in this case is 50ms.
A few interesting things:
- not sure what causes the sort of periodic striated pattern in the number
of young generation regions chosen
- most of the time no old gen regions are selected for collection at all!
Here's a graph of just old regions:
http://people.apache.org/~todd/hbase-fragmentation/old-regions.png
- When old regions are actually selected for collection the heap usage does
drop, though elapsed time does spike over the guarantee.
So seems like something about the heuristics aren't quite right. Thoughts?
-Todd
On Fri, Jan 21, 2011 at 11:38 AM, Todd Lipcon <> wrote:
> Hey folks,
>
> Took some time over the last day or two to follow up on this on the latest
> checkout of JDK7. I added some more instrumentation and my findings so far
> are:
>
> 1) CMS is definitely hitting a fragmentation problem. Our workload is
> pretty much guaranteed to fragment, and I don't think there's anything CMS
> can do about it - see the following graphs:
> http://people.apache.org/~todd/hbase-fragmentation/
>
> 2) G1GC is hitting
> full pauses because the "other" pause time predictions end up higher than
> the minimum pause length. I'm seeing the following sequence:
>
> - A single choose_cset operation for a non_young region takes a long time
> (unclear yet why this is happening, see below)
> - This inflates the predict_non_young_other_time_ms(1) result to a value
> greater than my pause goal
> - From then on, it doesn't collect any more non-young regions (ever!)
> because any region will be considered expensive regardless of the estimated
> rset or collection costs
> - The heap slowly fills up with non-young regions until we reach a full GC
>
> 3) So the question is why the choose_cset is taking a long time. I added
> getrusage() calls to wrap the choose_cset operation. Here's some output with
> extra logging:
>
> --> About to choose cset at 725.458
> Adding 1 young regions to the CSet
> Added 0x0000000000000001 Young Regions to CS.
> (3596288 KB left in heap.)
> (picked region; 9.948053ms predicted; 21.164738ms remaining; 2448kb
> marked; 2448kb maxlive; 59-59% liveness)
> (3593839 KB left in heap.)
> predict_region_elapsed_time_ms: 10.493828ms total, 5.486072ms rs scan
> (14528 cnum), 4.828102 copy time (2333800 bytes), 0.179654 other time
> (picked region; 10.493828ms predicted; 11.216685ms remaining; 2279kb
> marked; 2279kb maxlive; 55-55% liveness)
> predict_region_elapsed_time_ms: 10.493828ms total, 5.486072ms rs scan
> (14528 cnum), 4.828102 copy time (2333800 bytes), 0.179654 other time
> (3591560 KB left in heap.)
> predict_region_elapsed_time_ms: 10.346346ms total, 5.119780ms rs scan
> (13558 cnum), 5.046912 copy time (2439568 bytes), 0.179654 other time
> predict_region_elapsed_time_ms: 10.407672ms total, 5.333135ms rs scan
> (14123 cnum), 4.894882 copy time (2366080 bytes), 0.179654 other time
> (no more marked regions; next region too expensive (adaptive; predicted
> 10.407672ms > remaining 0.722857ms))
> Resource usage of choose_cset:majflt: 0 nswap: 0 nvcsw: 6 nivcsw: 0
> --> About to prepare RS scan at 725.657
>
> The resource usage line with nvcsw=6 indicates there were 6 voluntary
> context switches while choosing cset. This choose_cset operation took
> 198.9ms all in choosing non-young.
>
> So, why are there voluntary context switches while choosing cset? This
> isn't swapping -- that should show under majflt, right? My only theories
> are:
> - are any locks acquired in choose_cset?
> - maybe the gc logging itself is blocking on IO to the log file? ie the
> instrumentation itself is interfering with the algorithm?
>
>
> Regardless, I think a single length choose_non_young_cset operation
> shouldn't be allowed to push the prediction above the time boundary and
> trigger this issue. Perhaps a simple workaround is that, whenever a
> collection chooses no non_young regions, it should contribute a value of 0
> to the average?
>
> I'll give this heuristic a try on my build and see if it solves the issue.
>
> -Todd
>
> On Tue, Jul 27, 2010 at 3:08 PM, Todd Lipcon <> wrote:
>
>> Hi all,
>>
>> Back from my vacation and took some time yesterday and today to build a
>> fresh JDK 7 with some additional debug printouts from Peter's patch.
>>
>> What I found was a bit different - the rset scanning estimates are low,
>> but I consistently am seeing "Other time" estimates in the >40ms range.
>> Given my pause time goal of 20ms, these estimates are I think excluding most
>> of the regions from collectability. I haven't been able to dig around yet to
>> figure out where the long estimate for "other time" is coming from - in the
>> collections logged it sometimes shows fairly high "Other" but the "Choose
>> CSet" component is very short. I'll try to add some more debug info to the
>> verbose logging and rerun some tests over the next couple of days.
>>
>> At the moment I'm giving the JRockit VM a try to see how its deterministic
>> GC stacks up against G1 and CMS.
>>
>> Thanks
>> -Todd
>>
>>
>> On Tue, Jul 13, 2010 at 5:15 PM, Peter Schuller <
>> > wrote:
>>
>>> Ramki/Tony,
>>>
>>> > Any chance of setting +PrintHeapAtGC and -XX:+PrintHeapAtGCExtended and
>>> > sending us the log, or part of it (say between two Full GCs)? Be
>>> prepared:
>>> > this will generate piles of output. But it will give us per-region
>>> > information that might shed more light on the cause of the issue....
>>> thanks,
>>>
>>> So what I have in terms of data is (see footnotes for urls references in
>>> []):
>>>
>>> (a) A patch[1] that prints some additional information about estimated
>>> costs of region eviction, and disables the GC efficiency check that
>>> normally terminates selection of regions. (Note: This is a throw-away
>>> patch for debugging; it's not intended as a suggested change for
>>> inclusion.)
>>>
>>> (b) A log[2] showing the output of a test run I did just now, with
>>> both your flags above and my patch enabled (but without disabling the
>>> efficiency check). It shows fallback to full GC when the actual live
>>> set size is 252 MB, and the maximum heap size is 2 GB (in other words,
>>> ~ 12% liveness). An easy way to find the point of full gc is to search
>>> for the string 'full 1'.
>>>
>>> (c) A file[3] with the effective VM options during the test.
>>>
>>> (d) Instructions for how to run the test to reproduce it (I'll get to
>>> that at the end; it's simplified relative to previously).
>>>
>>> (e) Nature of the test.
>>>
>>> Discussion:
>>>
>>> WIth respect to region information: I originally tried it in response
>>> to your recommendation earlier, but I found I did not see the
>>> information I was after. Perhaps I was just misreading it, but I
>>> mostly just saw either 0% or 100% fullness, and never the actual
>>> liveness estimate as produced by the mark phase. In the log I am
>>> referring to in this E-Mail, you can see that the last printout of
>>> region information just before the live GC fits this pattern; I just
>>> don't see anything that looks like legitimate liveness information
>>> being printed. (I don't have time to dig back into it right now to
>>> double-check what it's printing.)
>>>
>>> If you scroll up from the point of the full gc until you find a bunch
>>> of output starting with "predict_region_elapsed_time_ms" you see some
>>> output resulting from the patch, with pretty extreme values such as:
>>>
>>> predict_region_elapsed_time_ms: 34.378642ms total, 34.021154ms rs scan
>>> (46542 cnum), 0.040069 copy time (20704 bytes), 0.317419 other time
>>> predict_region_elapsed_time_ms: 45.020866ms total, 44.653222ms rs scan
>>> (61087 cnum), 0.050225 copy time (25952 bytes), 0.317419 other time
>>> predict_region_elapsed_time_ms: 16.250033ms total, 15.887065ms rs scan
>>> (21734 cnum), 0.045550 copy time (23536 bytes), 0.317419 other time
>>> predict_region_elapsed_time_ms: 226.942877ms total, 226.559163ms rs
>>> scan (309940 cnum), 0.066296 copy time (34256 bytes), 0.317419 other
>>> time
>>> predict_region_elapsed_time_ms: 542.344828ms total, 541.954750ms rs
>>> scan (741411 cnum), 0.072659 copy time (37544 bytes), 0.317419 other
>>> time
>>> predict_region_elapsed_time_ms: 668.595597ms total, 668.220877ms rs
>>> scan (914147 cnum), 0.057301 copy time (29608 bytes), 0.317419 other
>>> time
>>>
>>> So in the most extreme case in the excerpt above, that's > half a
>>> second of estimate rset scanning time for a single region with 914147
>>> cards to be scanned. While not all are that extreme, lots and lots of
>>> regions are very expensive and almost only due to rset scanning costs.
>>>
>>> If you scroll down a bit to the first (and ONLY) partial that happened
>>> after the statistics accumulating from the marking phase, we see more
>>> output resulting form the patch. At the end, we see:
>>>
>>> (picked region; 0.345890ms predicted; 1.317244ms remaining; 15kb
>>> marked; 15kb maxlive; 1-1% liveness)
>>> (393380 KB left in heap.)
>>> (picked region; 0.345963ms predicted; 0.971354ms remaining; 15kb
>>> marked; 15kb maxlive; 1-1% liveness)
>>> (393365 KB left in heap.)
>>> (picked region; 0.346036ms predicted; 0.625391ms remaining; 15kb
>>> marked; 15kb maxlive; 1-1% liveness)
>>> (393349 KB left in heap.)
>>> (no more marked regions; next region too expensive (adaptive;
>>> predicted 0.346036ms > remaining 0.279355ms))
>>>
>>> So in other words, it picked a bunch of regions in order of "lowest
>>> hanging fruit". The *least* low hanging fruit picked still had
>>> liveness at 1%; in other words, there's plenty of further regions that
>>> ideally should be collected because they contain almost no garbage
>>> (ignoring the cost of collecting them).
>>>
>>> In this case, it stopped picking regions because the next region to be
>>> picked, though cheap, was the straw that broke the camel's neck and we
>>> simply exceeded the alloted time for this particular GC.
>>>
>>> However, after this partial completes, it reverts back to doing just
>>> young gc:s. In other words, even though there's *plenty* of regions
>>> with very low liveness, further partials aren't happening.
>>>
>>> By applying this part of the patch:
>>>
>>> - (adaptive_young_list_length() &&
>>> + (adaptive_young_list_length() && false && // scodetodo
>>>
>>> I artificially force g1 to not fall back to doing young gc:s for
>>> efficiency reasons. When I run with that change, I don't experience
>>> the slow perpetual growth until fallback to full GC. If I remember
>>> correctly though, the rset scanning cost is in fact high, but I don't
>>> have details saved and I'm afraid I don't have time to re-run those
>>> tests right now and compare numbers.
>>>
>>> Reproducing it:
>>>
>>> I made some changes and the test case should now hopefully be easy to
>>> run assuming you have maven installed. The github project is at:
>>>
>>> http://github.com/scode/httpgctest
>>>
>>> There is a README, but the shortest possible instructions to
>>> re-produce the test that I did:
>>>
>>> git clone git://github.com/scode/httpgctest.git
>>> cd httpgctest.git
>>> git checkout 20100714_1 # grab from appropriate tag, in case I
>>> change master
>>> mvn package
>>> HTTPGCTEST_LOGGC=gc.log ./run.sh
>>>
>>> That should start the http server; then run concurrently:
>>>
>>> while [ 1 ] ; do curl 'http://localhost:9191/dropdata?ratio=0.10' ;
>>> curl 'http://localhost:9191/gendata?amount=25000' ; sleep 0.1 ; done
>>>
>>> And then just wait and observe.
>>>
>>> Nature of the test:
>>>
>>> So the test if run as above will essentially reach a steady state of
>>> equilibrium with about 25000 pieces of data in a clojure immutable
>>> map. The result is that a significant amount of new data is being
>>> allocated, but very little writing to old regions is happening. The
>>> garbage generated is very well spread out over the entire heap because
>>> it goes through all objects and drops 10% (the ratio=0.10) for each
>>> iteration, after which it adds 25000 new items.
>>>
>>> In other words; not a lot of old gen writing, but lots of writes to
>>> the young gen referencing objects in the old gen.
>>>
>>> [1] http://distfiles.scode.org/mlref/g1/g1_region_live_stats_hack.patch
>>> [2]
>>> http://distfiles.scode.org/mlref/gctest/httpgctest-g1-fullgc-20100714/gc-fullfallback.log
>>> [3]
>>> http://distfiles.scode.org/mlref/gctest/httpgctest-g1-fullgc-20100714/vmoptions.txt
>>>
>>> --
>>> / Peter Schuller
>>>
>>
>>
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>
--
Todd Lipcon
Software Engineer, Cloudera
> Â - most of the time no old gen regions are selected for collection at all!
> Here's a graph of just old regions:
> http://people.apache.org/~todd/hbase-fragmentation/old-regions.png
This is consistent with my anecdotal observations as well and I
believe it is expected. What I have observed happening is that
non-young (partial) collections always happen after the marking phases
some number of times, followed by young collections only until another
marking phase is triggered and completed.
I think this makes sense because region selection is based on cost
heuristics largely based on liveness data from marking. So you have
your marking phase followed by a period of decreasing availability of
non-young regions that are eligible for collection given the GC
efficiency goals (and the pause time goals), until there are 0 such.
Young collections then continue until unrelated criteria trigger a new
marking phase, giving non-young regions a chance again to get above
the eligibility watermark.
--
/ Peter Schuller
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
> I still seem to be putting off GC of non-young regions too much though. I
Part of my experiments I have been harping on was the below change to
cut GC efficiency out of the decision to perform non-young
collections. I'm not suggesting it actually be disabled, but perhaps
it can be adjusted to fit your workload? If there is nothing outright
wrong in terms of predictions and the problem is due to cost estimates
being too high, that may be a way to avoid full GC:s at the expense of
more expensive GC activity. This smells like something that should be
a tweakable VM option. Just like GCTimeRatio affects heap expansion
decisions, something to affect this (probably just a ratio applied to
the test below?).
Another thing: This is to a large part my human confirmation biased
brain speaking, but I would be really interested to find out if if the
slow build-up you seem to be experiencing is indeed due to rs scan
costs de to sparse table overflow (I've been harping about roughly the
same thing several times so maybe people are tired of it; most
recently in the thread "g1: dealing with high rates of inter-region
pointer writes").
Is your test easily runnable so that one can reproduce? Preferably
without lots of hbase/hadoop knowledge. I.e., is it something that can
be run in a self-contained fashion fairly easily?
Here's the patch indicating where to adjust the efficiency thresholding:
--- a/src/share/vm/gc_implementation/g1/g1CollectorPolicy.cpp Fri
Dec 17 23:32:58 2010 -0800
+++ b/src/share/vm/gc_implementation/g1/g1CollectorPolicy.cpp Sun
Jan 23 09:21:54 2011 +0100
@@ -1463,7 +1463,7 @@
if ( !_last_young_gc_full ) {
if ( _should_revert_to_full_young_gcs ||
_known_garbage_ratio < 0.05 ||
- (adaptive_young_list_length() &&
+ (adaptive_young_list_length() && //false && // scodetodo
(get_gc_eff_factor() * cur_efficiency < predict_young_gc_eff())) ) {
set_full_young_gcs(true);
}
--
/ Peter Schuller
_______________________________________________
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
Unfortunately my test is not easy to reproduce in its current form. But as I
look more and more into it, it looks like we're running into the same issue.
I added some code at the end of the mark phase that, after it sorts the
regions by efficiency, will print an object histogram for any regions that
are >98% garbage but very inefficient (<100KB/ms predicted collection rate)
Here's an example of an "uncollectable" region that is all garbage but for
one object:
Region 0x00002aaab0203e18 ( M1) [0x00002aaaf3800000, 0x00002aaaf3c00000]
Used: 4096K, garbage: 4095K. Eff: 6.448103 K/ms
Very low-occupancy low-efficiency region. Histogram:
num #instances #bytes class name
----------------------------------------------
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
Added one more bit of debug output here.. whenever it comes across a region
with only one live object, it also dumps the stats on the regions in the
coarse rset. For example:
Region 0x00002aaab0c48388 ( M1) [0x00002aac7f000000, 0x00002aac7f400000]
Used: 4096K, garbage: 4095K. Eff: 30.061317 K/ms
Very low-occupancy low-efficiency region.
RSet: coarse: 90112 fine: 0 sparse: 2717
num #instances #bytes class name
----------------------------------------------
___________________________________________________
Posted on the Hotspot-gc-dev mailing list. Go to http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-dev to subscribe.
|
NewsArc Lists
| Culture Pages
| Computing Archive
| Media-Pages
Link to this page on your blog or website by copying the HTML code below and pasting it into your site:
|
|