Analysis and solution of online Finalize() timeout problem
For a long time in the past, the online availability rate of the Android app has remained at around XX.X%. In recent months, a large number of TimeoutExceptions have been found in the statistics. In order to affect the number of users, XX of TopXX’s crashes are caused by similar problems.
From the statistics of the problem, more than XX% appear in the OPPO R9 series system, XX% occur in the background, the vast majority occurred before and after the application started X minutes, Android system version X.X and X.X accounted for And more than XX%, the proportion of non-Root users exceeds XX%. Based on the various data, this is a problem that occurs in a specific model/system, which is less sensible to users and has a higher incidence rate.
The user experience is a vital part of the application, so taking the time to analyze and solve the problem.
Results
The problem has been solved. The data involves sensitive information and is hidden here.
Problem Analysis
Direct Reason
First find the crime scene, located in the FinalizerWatchdogDaemon
of the java.lang.Daemons
class.
1 | private static void finalizerTimedOut(Object object) { |
As you can see from the source, after calling the finalizerTimedOut()
function, instantiate an UncaughtException
, assemble the error, get the stack information, and then throw the exception, which is the crash scene we saw on Bugly.
So why is this function called to throw an exception? Continue to look at the Deameons
class, which can be found in FinalizerWatchdogDaemon
.
1 | public void runInternal() { |
It can be seen that when WatchDog
finds that the finalize
procedure releases the resource timeout, it throws an exception and terminates the current function.
So, what are the uses of Deamons
, FinalizerWatchdogDaemon
, and what kind of scenarios have these errors occurred?
Root Reason
The Deamons
class has a pair of start()
stop()
functions, where the start()
function is implemented in API 28 as follows:
1 | public static void start() { |
You can see that it has started four threads. If a class overrides the finalize()
function, a new FinalizerReference
will be referenced when the class is instantiated. When the instantiated object has and only FinalizerReference
references it, it can be considered The object is suitable for recycling by the GC, and the object will be added to the ReferenceQueue
.
ReferenceQueueDaemon
is the daemon thread of ReferenceQueue
. When it executes, it pops out the objects in the queue and executes the corresponding finalize()
function in turn.
FinalizerDaemon
is the destructor daemon thread that executes the above finalize()
procedure when the GC fires.
FinalizerWatchdogDaemon
is a destructive monitoring daemon thread, as the name implies, is used to monitor the FinalizerDaemon
thread. When the object recovery process times out when certain conditions are met, a TimeoutException
will be thrown.
HeapTrimmerDaemon
is a heap clipping daemon thread that is used to reclaim heap memory.
Look at the above, the part that throws the exception under certain conditions, the key source code is as follows:
1 | /** |
It can be seen that after waiting for the MAX_FINALIZE_NANOS
time, it is judged whether the monitored object recovery process has been completed. If it is not completed, wait for the NANOS_PER_SECOND / 2
time and judge again. If it is still not completed, return a non-empty object. , that is, the condition for throwing a timeout exception is satisfied.
At this point, the context of the crime scene has basically been clarified, so why is the problem of object recovery timeout?
First, through the code check, the exclusion is not possible when a custom class rewrites ‘finalize()` on the business. Through statistical analysis, it is found that the specific model accounts for about XX%, and the guess may be the corresponding model system. There are special changes in the bottom layer, and more than XX% of the anomalies occur in the background. The guess may be related to the doze mechanism. The real reason remains to be investigated.
Solutions
From the above analysis, it is easy to think of the following subdivision solutions:
- Increase the timeout slightly and expect to reduce the frequency of problems
- Set the timeout directly to a very large value so that the condition for throwing the exception cannot be satisfied.
- After the application starts, close
FinalizerWatchdogDaemon
and do not monitor whether the destructor process is executed correctly. - When a timeout exception occurs, catch the exception and close/restart
FinalizerWatchdogDaemon
Based on the above several solutions, plus the filtering process for specific models, it is guaranteed that the solution will not be executed on other models not involved.
Currently using the fourth option, from the online data, the problem no longer appears.
Because the related classes and methods are private, they need to be called by reflection. At the same time, because Android9.0 and above have warnings and restrictions on the call of private classes, and some of the code has multi-thread security issues before Android 6.0 (excluding 6.0), so you need to add version judgment and thread in the right place. deal with.
In addition, because the fourth scheme is to capture the timeout exception, it needs to be captured in the application’s UncaughtExceptionHandler
interface implementation class. The implementation of this interface is not a problem described in this article, so it will not be described in detail here.
Provide a reference code from Github here (multi-threading problem, you can modify it yourself).
1 | final Class clazz = Class.forName("java.lang.Daemons$FinalizerWatchdogDaemon"); |