Opened 12 years ago

Closed 10 years ago

#390 closed defect (fixed)

new RuptureSurface distance methods are not thread safe

Reported by: Kevin Milner Owned by:
Priority: critical Milestone: OpenSHA 1.3
Component: sha Version:
Keywords: Cc:

Description

Peter - this may definitely affect your current work.

I've been getting weird (non-reproducible) exceptions while doing threaded hazard calculations (using a single ERF object, and an IMR object for each thread). I traced it down to the rupture distance methods not being thread safe. They use cached values to avoid recalculating distances for the same site, but are not synchronized. For example

(from AbstractEvenlyGriddedSurface?.java)

public double getDistanceRup(Location siteLoc){
	if(!siteLocForDistCalcs.equals(siteLoc)) {
		siteLocForDistCalcs = siteLoc;
		setPropagationDistances();
	}
	return distanceRup;
}

Ned - in case you don't understand the issue here, consider this scenario: Imagine that thread1 calls getDistanceRup(loc1) and loc1 is already in the cache. It will first check to see if it is match, which it is. But then thread2 could come along and call the same method with another location (loc2), which resets the propogation distances. This could all happen before thread1 gets to the "return distanceRup" statement and it could erroneously return the distance from loc2 instead of loc1. This is very dangerous (and is already causing a problem with multithreaded calculations)

I see a couple ways of dealing with this:

  1. slap the "synchronized" keyword on all such methods. this quickly solves the problem, but removes any efficiencies from the caching in a multithreaded environment as different threads will be constantly resetting the cached value.
  2. use a cache that can store multiple values - maybe remembering the last 10 sites where distances were calculated. this could also be tuned to the number of available CPUs, so if your computer has 24 cores (as some compute nodes that I use do), the cache size would default to 24. You could also specify this as java property value (-DRuptureSurface.cacheSize=24 for example).
  3. similar to (2) above, but instead of implementing it in each class, first strip out all caching from individual rupture surfaces. Then create either a wrapper class which handles caching, or an AbstractRuptureSurface? class which all RuptureSurface? instances would extend which would handle caching.

This is very important and should be dealt with soon. Thoughts?

Change History (6)

comment:1 Changed 12 years ago by Kevin Milner

I've implemented (1) above (in [8865]) as a safety measure for now by adding synchronized to all of the distance methods on:

I don't see this as a long term solution as is hurts our parallelism - just a short term band-aid to prevent bad results.

comment:2 Changed 12 years ago by Peter Powers

Good catch. I was just mentioning threading to Ned at lunch today.

I'd leave the band-aid for now. It's not the worst thing in the world.

Stepping back, I would say we've never really benchmarked the distance calculations relative to an entire hazard calculation. At least I haven't. The caching mechanism really only caters to multi-IMR calculations and its conceivable there are much greater gains to be made by making the distance calculation implementations faster. It's also funny that this is coming up in the context of multi-threaded multi-node calculations; we should make something that's stable in a multithreaded environment, which is providing it's own efficiencies, and only then look for more granular optimizations. Just thinking out loud here on a Friday.

comment:3 Changed 12 years ago by Kevin Milner

This caching will potentially have big benefits in UCERF3 though, even for single IMR calculations. For UCERF3 we'll have hundreds of thousands of ruptures, but only a couple thousand sub section surfaces. So you'll only have to calculate the distance to an individual sub section once as opposed to thousands of times (as many ruptures might use that sub section).

comment:4 Changed 12 years ago by Peter Powers

Assuming we don't end up floating smaller ruptures on the sub-sections, I've got a magic bullet.

comment:5 Changed 12 years ago by Ned Field

So what's the "magic bullet"?

Lets not do anything extreme until we have a satisfactory set of Junit tests (much of our code was simply not written with threading in mind).

comment:6 Changed 10 years ago by Kevin Milner

Resolution: fixed
Status: newclosed

I believe all of these issues have been resolved

Note: See TracTickets for help on using tickets.