With Pavel and Christophe we spent some time digging these last weeks chasing the memory leaks we were seeing lately. It is a long story to tell, so this mail is divided in three:
1) A brief intro to weak structures and finalization in Pharo, for those that do not know,
2) A bit of history to explain what happened in pre-spur and post-spur,
3) The actual cause of the memory leak today,
4) How to avoid them in your application, and what are we going to do to prevent this in the future.
For those that need/want/prefer just the practical explanation, you can jump over 2) and just read 1) and 3).
1. A weak explanation
To cleanup objects upon garbage collection, Pharo and Squeak use a finalization mechanism based on a Weak Registry. That is, if you want to execute some cleanup (like closing a file) when an object is about to be collected, you have to put your object inside the weak registry with the corresponding executor/finalizer object. The object you want to ‘track’ is hold weakly by this weak registry i.e., if the only reference to the object is from the weak registry, it will be chosen for garbage collection. When this object is collected, a special process in the Pharo image will send #finalize to your executor object where you implement your cleanup.
To interact with the weak registry, there are two main subscription messages:
– #add:executor: Will add an object to the registry with the executor that is send as argument.
– #add: Will add an object to the registry, and use as executor a ‘shallow copy’ of the object.
Some conclusions to be made from this:
1) If the executor points strongly to the object that we want to collect, it will never be collected. That is why the #add: message creates a copy of the object.
2) If we do not provide an explicit executor, the registered object should already contain all information required for the finalization (like file handlers or external pointers). If not, the shallow copy will not be able to finalize correctly.
– Using weak objects/references do not guarantee that #finalize will be called, you need to put your object inside the registry!
– Using weak objects/references do not guarantee that your object will be magically collected. You can still cause memory leaks!
2. A weak story
Pharo and Squeak use historically the weak registry mentioned above. Because of the limitations that we mentioned, a different kind of weak structure called Ephemerons is required/more useful. To overcome some of these limitations, Igor (Hi Igor! maybe you’re reading :)) implemented a couple of years ago a new finalization mechanism that, IIANM, worked as follows:
– Some weak objects could have a first instance variable with a special linked list
– When the object was about to be collected, instead it was removed from the weak structure and put into its container’s linked list
– On the image side, a special process iterated all special linked lists and executed #finalize on the weak objects.
This mechanism was called NewFinalization, in contrast with what was called LegacyFinalization. Of course these names are context dependent, since today’s Pharo is back to the so called legacy one ;). NewFinalization was implemented as the default finalization mechanim in Pharo, both in VM and image side. But the VM changes remained in the Pharo branch of development. After some discussions, I remember Igor and Eliot agreed that what they actually needed were Ephemerons, and since Eliot had started working on Spur at that time, he said he would provide Ephemeric classes with the new object format.
Basically, for those interested, an ephemeron is an association
weak key -> strong value
with the special quality that upon garbage collection all references to the weak key that are computed from the strong value (directly or indirectly) are taken as weak. This allows the collection of the weak key even if the strong value points to it, but requires some more machinery in the GC/VM. You can read more in here .
Until a couple of months/weeks ago, Pharo was using the NewFinalization mechanism with its special image and VM support. And Squeak was using the ‘Legacy’ one. And then Spur arrived.
So Spur arrived, and Eliot and Esteban made a lot of effort to simplify the VM’s maintenance, and they merged both branches. As a conclusion, Pharo Spur VM did not support any more NewFinalization. This provoked at first some leaks because objects were not being finalized. A couple of weeks ago, we migrated back the image code to use the ‘Legacy’ mechanism, see issue 17537 .
And then finalization was not working either. Nor #finalize was being called on executors, nor objects in the weak registry were collected. As a symptom, opening any tool will cause 30 new everlasting registrations into the weakregistry, and no tools were collected.
3. The cause
After lots of digging, we finally found what was the particular issue causing objects in the weak registry to not be collected. In some words, it is caused by the normal belief that “weak objects are magical”, which caused that weak references and finalizers are really spread over the system with no proper care. And particularly related to the usage of announcements.
To explain better, I made some pictures for you 🙂
***First, imagine you have a morph with its own local announcer. You subscribe to two events, and the graph will look like this.
– the announcer knows two strong subscriptions
– the subscriptions know the announcer to be able to unregister
– the subscriptions know the registered object to send the message in case the event happens
This forms a closed graph that will be collected. No problem so far.
***Second, let’s see what happens if we use weak subsriptions:
– the announcer knows two weak subscriptions
– these weak subscriptions know the announcer strongly to be able to unregister
– they also know the subscriber object but weakly
– THE difference is made by the weak registry: a global object that manages when and how objects are finalized. In the case of announcers, the weak registry will store weakly the subscriber morph, and strongly the weak announcer subscription.
So far so good also: the references to the morph are weak. When the morph is collected, the weak registry will execute finalize on the announcement subscriptions. The subscriptions will unregister from the morph.
***The really problematic case is the third one: mixing weak and strong subscriptions in the same announcer.
The object graph is just a mixture of the two other ones. One weak subscription and one strong subscription. BUT:
– there is a strong path from a global object (the weak registry) to the subscriber (the morph)
– then the morph is never collected
– the weak registry never finalizes the weak announcement subscription
– the graph remains there forever.
And these are the simple cases that show the problem. Imagine that you can have this same configuration but in cycles/chains among different morphs/announcements. Plus this is aggravated by evil globals (e.g., the theme and the HandMorph remembers the last focused morph, the system window class remembers the last top window even if it was closed…).
4. The solution?
Our solution for the moment is simple. We would like to enforce the following two rules for announcements:
– announcers local to a morph should only be used strongly. YES, this may cause small hiccups and leaks, for example if you register a morph A to the announcer to another morph B. But in the long term, these two will form a closed graph and will be collected.
– announcers used globally, such as the System announcer, should be used only and uniquely in a weak manner. Like that we ensure that they are loosely coupled for real.
So, please, please, do not use weak announcements unless you’re really sure of what you’re doing. At least, until we have ephemerons and we are sure everything works as expected. Ephemerons would solve this in a more natural way: if we model the weak registry subscription as an ephemeron, any reference to the weak #key that arrives from the #value will be treated as weak also.
Other action points we are working on:
– fixing tools to follow the rules above
– We are also writing tests to check that tools (gt*, Nautilus, Rubric, FT) do not leak.
– chasing other small memory leaks created by stepping, focus global variables…
((fogbugz allIssues select: [ :each | each relatedToLeak ])
flatCollect: [ :each | each participants ])