Analysis and solution of online Finalize() timeout problem

For a long time in the past, the online availability rate of the Android app has remained at around XX.X%. In recent months, a large number of TimeoutExceptions have been found in the statistics. In order to affect the number of users, XX of TopXX’s crashes are caused by similar problems.

From the statistics of the problem, more than XX% appear in the OPPO R9 series system, XX% occur in the background, the vast majority occurred before and after the application started X minutes, Android system version X.X and X.X accounted for And more than XX%, the proportion of non-Root users exceeds XX%. Based on the various data, this is a problem that occurs in a specific model/system, which is less sensible to users and has a higher incidence rate.

The user experience is a vital part of the application, so taking the time to analyze and solve the problem.

Results

The problem has been solved. The data involves sensitive information and is hidden here.

Problem Analysis

Direct Reason

First find the crime scene, located in the FinalizerWatchdogDaemon of the java.lang.Daemons class.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
private static void finalizerTimedOut(Object object) {
// The current object has exceeded the finalization deadline; abort!
String message = object.getClass().getName() + ".finalize() timed out after "
+ (MAX_FINALIZE_NANOS / NANOS_PER_SECOND) + " seconds";
Exception syntheticException = new TimeoutException(message);
// We use the stack from where finalize() was running to show where it was stuck.
syntheticException.setStackTrace(FinalizerDaemon.INSTANCE.getStackTrace());

// Send SIGQUIT to get native stack traces.
try {
Os.kill(Os.getpid(), OsConstants.SIGQUIT);
// Sleep a few seconds to let the stack traces print.
Thread.sleep(5000);
} catch (Exception e) {
System.logE("failed to send SIGQUIT", e);
} catch (OutOfMemoryError ignored) {
// May occur while trying to allocate the exception.
}

// Ideally, we'd want to do this if this Thread had no handler to dispatch to.
// Unfortunately, it's extremely to messy to query whether a given Thread has *some*
// handler to dispatch to, either via a handler set on itself, via its ThreadGroup
// object or via the defaultUncaughtExceptionHandler.
//
// As an approximation, we log by hand an exit if there's no pre-exception handler nor
// a default uncaught exception handler.
//
// Note that this condition will only ever be hit by ART host tests and standalone
// dalvikvm invocations. All zygote forked process *will* have a pre-handler set
// in RuntimeInit and they cannot subsequently override it.
if (Thread.getUncaughtExceptionPreHandler() == null &&
Thread.getDefaultUncaughtExceptionHandler() == null) {
// If we have no handler, log and exit.
System.logE(message, syntheticException);
System.exit(2);
}

// Otherwise call the handler to do crash reporting.
// We don't just throw because we're not the thread that
// timed out; we're the thread that detected it.
Thread.currentThread().dispatchUncaughtException(syntheticException);
}

As you can see from the source, after calling the finalizerTimedOut() function, instantiate an UncaughtException, assemble the error, get the stack information, and then throw the exception, which is the crash scene we saw on Bugly.

So why is this function called to throw an exception? Continue to look at the Deameons class, which can be found in FinalizerWatchdogDaemon.

1
2
3
4
5
6
7
8
9
10
11
12
13
@Override public void runInternal() {
while (isRunning()) {
if (!sleepUntilNeeded()) {
// We have been interrupted, need to see if this daemon has been stopped.
continue;
}
final Object finalizing = waitForFinalization();
if (finalizing != null && !VMRuntime.getRuntime().isDebuggerActive()) {
finalizerTimedOut(finalizing);
break;
}
}
}

It can be seen that when WatchDog finds that the finalize procedure releases the resource timeout, it throws an exception and terminates the current function.

So, what are the uses of Deamons, FinalizerWatchdogDaemon, and what kind of scenarios have these errors occurred?

Root Reason

The Deamons class has a pair of start() stop() functions, where the start() function is implemented in API 28 as follows:

1
2
3
4
5
6
public static void start() {
ReferenceQueueDaemon.INSTANCE.start();
FinalizerDaemon.INSTANCE.start();
FinalizerWatchdogDaemon.INSTANCE.start();
HeapTaskDaemon.INSTANCE.start();
}

You can see that it has started four threads. If a class overrides the finalize() function, a new FinalizerReference will be referenced when the class is instantiated. When the instantiated object has and only FinalizerReference references it, it can be considered The object is suitable for recycling by the GC, and the object will be added to the ReferenceQueue.

ReferenceQueueDaemon is the daemon thread of ReferenceQueue. When it executes, it pops out the objects in the queue and executes the corresponding finalize() function in turn.

FinalizerDaemon is the destructor daemon thread that executes the above finalize() procedure when the GC fires.

FinalizerWatchdogDaemon is a destructive monitoring daemon thread, as the name implies, is used to monitor the FinalizerDaemon thread. When the object recovery process times out when certain conditions are met, a TimeoutException will be thrown.

HeapTrimmerDaemon is a heap clipping daemon thread that is used to reclaim heap memory.

Look at the above, the part that throws the exception under certain conditions, the key source code is as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
/**
* Return an object that took too long to finalize or return null.
* Wait MAX_FINALIZE_NANOS. If the FinalizerDaemon took essentially the whole time
* processing a single reference, return that reference. Otherwise return null.
*/
private Object waitForFinalization() {
long startCount = FinalizerDaemon.INSTANCE.progressCounter.get();
// Avoid remembering object being finalized, so as not to keep it alive.
if (!sleepFor(MAX_FINALIZE_NANOS)) {
// Don't report possibly spurious timeout if we are interrupted.
return null;
}
if (getNeedToWork() && FinalizerDaemon.INSTANCE.progressCounter.get() == startCount) {
// We assume that only remove() and doFinalize() may take time comparable to
// MAX_FINALIZE_NANOS.
// We observed neither the effect of the gotoSleep() nor the increment preceding a
// later wakeUp. Any remove() call by the FinalizerDaemon during our sleep
// interval must have been followed by a wakeUp call before we checked needToWork.
// But then we would have seen the counter increment. Thus there cannot have
// been such a remove() call.
// The FinalizerDaemon must not have progressed (from either the beginning or the
// last progressCounter increment) to either the next increment or gotoSleep()
// call. Thus we must have taken essentially the whole MAX_FINALIZE_NANOS in a
// single doFinalize() call. Thus it's OK to time out. finalizingObject was set
// just before the counter increment, which preceded the doFinalize call. Thus we
// are guaranteed to get the correct finalizing value below, unless doFinalize()
// just finished as we were timing out, in which case we may get null or a later
// one. In this last case, we are very likely to discard it below.
Object finalizing = FinalizerDaemon.INSTANCE.finalizingObject;
sleepFor(NANOS_PER_SECOND / 2);
// Recheck to make it even less likely we report the wrong finalizing object in
// the case which a very slow finalization just finished as we were timing out.
if (getNeedToWork()
&& FinalizerDaemon.INSTANCE.progressCounter.get() == startCount) {
return finalizing;
}
}
return null;
}

It can be seen that after waiting for the MAX_FINALIZE_NANOS time, it is judged whether the monitored object recovery process has been completed. If it is not completed, wait for the NANOS_PER_SECOND / 2 time and judge again. If it is still not completed, return a non-empty object. , that is, the condition for throwing a timeout exception is satisfied.

At this point, the context of the crime scene has basically been clarified, so why is the problem of object recovery timeout?

First, through the code check, the exclusion is not possible when a custom class rewrites ‘finalize()` on the business. Through statistical analysis, it is found that the specific model accounts for about XX%, and the guess may be the corresponding model system. There are special changes in the bottom layer, and more than XX% of the anomalies occur in the background. The guess may be related to the doze mechanism. The real reason remains to be investigated.

Solutions

From the above analysis, it is easy to think of the following subdivision solutions:

  1. Increase the timeout slightly and expect to reduce the frequency of problems
  2. Set the timeout directly to a very large value so that the condition for throwing the exception cannot be satisfied.
  3. After the application starts, close FinalizerWatchdogDaemon and do not monitor whether the destructor process is executed correctly.
  4. When a timeout exception occurs, catch the exception and close/restart FinalizerWatchdogDaemon

Based on the above several solutions, plus the filtering process for specific models, it is guaranteed that the solution will not be executed on other models not involved.

Currently using the fourth option, from the online data, the problem no longer appears.

Because the related classes and methods are private, they need to be called by reflection. At the same time, because Android9.0 and above have warnings and restrictions on the call of private classes, and some of the code has multi-thread security issues before Android 6.0 (excluding 6.0), so you need to add version judgment and thread in the right place. deal with.

In addition, because the fourth scheme is to capture the timeout exception, it needs to be captured in the application’s UncaughtExceptionHandler interface implementation class. The implementation of this interface is not a problem described in this article, so it will not be described in detail here.

Provide a reference code from Github here (multi-threading problem, you can modify it yourself).

1
2
3
4
5
6
7
final Class clazz = Class.forName("java.lang.Daemons$FinalizerWatchdogDaemon");
final Field field = clazz.getDeclaredField("INSTANCE");
field.setAccessible(true);
final Object watchdog = field.get(null);
final Method method = clazz.getSuperclass().getDeclaredMethod("stop");
method.setAccessible(true);
method.invoke(watchdog);