Skip to content

Thunking closures

In my last post I discussed a compile-time technique called lambda-lifting that can be used to eliminate free variables from closures and transform them into ordinary functions. The purpose of lambda-lifting is to avoid the need for the runtime environment to support closures as first-class objects.

In order to be effective this technique depends on the compiler being able to identify all closure call-sites and then perform function re-writing as necessary.  It turns out that it may not always be possible to meet this requirement, so in today’s post I want to discuss a method for managing closures at runtime, using a technique called thunking.

This is a longish post, and it goes into considerable detail for the simple reason that I have not found much of this information to be generally available elsewhere on the net – and certainly not all in one place. Add to this the fact that I’ve had to figure out quite a bit of what follows for myself, which has been painful at times, and I don’t see why I should be the only one to suffer. My goal is simple: by the time you finish reading this, you’ll know pretty much all the important stuff that there is to know about implementing thunks in compilers. Whether you’ll thank me or not once you know is a different matter.

Give up the Thunk

In VSL whenever a function is invoked that returns a closure, a reference to a thunk is returned rather than a reference to the actual closure. The thunk is a proxy for the closure: when invoked it performs run-time lambda-lifting of the closures’s free variables and then hands over to the closure itself. Thunks are needed in VSL because in general the compiler cannot reliably distinguish between calls to closures and calls to ordinary functions. This means that calls to functions of both types must be invoked the same way, and the complexity of managing calls to closures can be encapsulated in the thunk rather than in the code at each function call-site.

(Talking about) My Generation

A VSL thunk consists of two parts: a 10-byte header section containing executable code, followed by an environment section containing some data: the address of the target function (closure), together with the number and values of its free variables. When invoked, a thunk uses the environment section to set up the stack for the target function and then dispatches execution to it. For this to happen, the CPU must execute some instructions. However because thunks are created at runtime, those instructions must be dynamically generated – they cannot be pre-compiled. Thankfully, most of the required work can be delegated to a helper function that can be pre-compiled, so we only need to generate a couple of instructions in the thunk itself.

I will use the following simple VSL function to illustrate the actual compiler output for a function that returns a closure.

product(x) = { lambda(y) = { x * y } }

First, the anonymous function (denoted by lambda) is compiled. The VSL compiler performs in-place compilation of anonymous functions; it does not hoist them into higher lexical scopes. The compiler assigns dummy stack addresses for all the free variables the lambda uses, in the order they are encountered. In our example lambda’s bound variable ‘y’ would be located at the address [ebp + 4], so the free variable ‘x’ is assigned to the address [ebp + 8]. If there were more free variables they would be assigned addresses [ebp + 12], [ebp + 16], [ebp + 20] and so on.

(Incidentally the ability to assign the addresses of free variables without needing to modify the address of existing bound variables is the single major advantage of calling conventions where stack arguments are pushed from right-to-left)

Here’s the first part of the compiler output:

product PROC
    push ebp
    mov  ebp, esp
    jmp _$L2
        push ebp
        mov  ebp, esp          ; save stack frame
        mov  eax, [ebp + 8]    ; free variable x
        xor  edx, edx
        mul  [ebp + 4]         ; bound variable y
        mov  esp, ebp
        pop  ebp               ; restore stack frame
        ret  4                 ; return to caller

Having compiled the inner function, we are now ready to generate code for the thunk itself. We typically only need a small amount of memory per thunk. In this case, because there is one free variable in the closure we obtain 18 + (1 * 4) i.e. 22 bytes from the global heap. The allocated memory pointer (the address of the new thunk) is returned in register eax. For the sake of argument we will assume this register points to absolute address 400000. Now we write the header op-codes. (I have included the quoted comments to show how these op-codes would be interpreted by a disassembler)

    mov   ecx, offset __t_dsp           ; helper function
    sub   ecx, 10
    mov   byte ptr [eax + 0], 0B8h      ; "load eax..."
    mov   byte ptr [eax + 1], eax       ; ...4000000"
    mov   byte ptr [eax + 5], 0E9h      ; "jmp ..."
    mov   byte ptr [eax + 6], ecx       ; __dsp_t"

Got my Mojo working

So far so good, but there’s a bit of voodoo going on here with the manipulation of the value in register ecx. This register contains the address of the helper function __t_dsp that the thunk is going to delegate to for most of the hard work of calling the closure. So why do we change its value?  The answer is due to a feature of the Intel x86 CPU instruction set. Specifically, the address in ecx must be modified because the CPU expects the op-code 0E9h to be followed by an address that has been adjusted with respect to the address of the instruction pointer.  Normally the assembler/linker would have managed all this jiggery-pokery, but we are generating op-codes at runtime, so we must fix up the jump address ourselves. In essence the 10 bytes subtracted from the address in ecx negate the effect of the 10 bytes of extra code we have dynamically assembled into memory – up to and including the jump instruction itself. If none of this makes any sense to you, don’t worry: it took me a long while of laboriously working it all out on pen and paper before I finally figured out why my code kept jumping into hyperspace instead of to my target function.

Anyway, we are now ready to write the target function’s environment into the rest of the thunk:

    mov   edi, offset _$L1              ; address of lambda
    mov   dword ptr [eax + 10], 01h     ; number of free variables
    mov   dword ptr [eax + 14], edi     ; copy function address into thunk
    mov   ebx, [ebp + 8]                ; value of ‘x’
    mov   dword ptr [eax + 18], ebx     ; copied into thunk

The thunk’s address is still in register eax, so we simply finish compilation of the function in the normal way:

    mov  esp, ebp
    pop  ebp                            ; restore stack frame
    ret 4                               ; return to caller
product ENDP

Mother’s Little Helper

As noted earlier, the thunk delegates to a helper function. This helper function unpacks the closure’s environment, which the thunk has thoughtfully referenced in eax thanks to the magic of dynamic code generation. It pushes all the free variables from right to left onto the stack, sets the return address so that the target tunction will return to the original call-site and then jumps to the target function. It is pretty simple, but here it is for completeness’ sake:

    pop      edi                        ; clear and save the return address of caller
    lea      esi, [eax + 10]
    mov      ecx, dword ptr [esi]
    lea      eax, dword ptr [esi + ecx * 4 + 4]
    test     ecx, ecx
    jz       @F
    push     eax
    dec      ecx
    jmp      @B
    push     edi                       ; set caller’s return address
    jmp      dword ptr [eax]           ; jump to closure

Into the Dragon’s Den

And that, said John, is that. Apart from the pitfalls, which are multiform…

Calling a function through a thunk slows the call down significantly when compared with calling the function directly. For this reason, creating lots of closures for trivial functions is not a good idea if high performance is a requirement of your program. Your thunking overhead should be small compared to the execution time of your closure.

Another thing to bear in mind is that thunks are heap objects whose lifetime is not determinable by the compiler. If you haven’t needed garbage collection so far, you do now.

Last but not least, thunking requires that your program execute code on the heap where data is normally expected to be. This is something that computer viruses also love to do. Some virus-checkers scan for instructions in programs that create op-codes and so will flag your program as malware. A number of Unix variants actually make the heap space non-executable by default, while some runtime libraries require you to explicitly turn execution-protection off when obtaining memory that you want to write executable code into. More recent versions of Microsoft Windows have a built-in service called Data Execution Protection which automatically terminates programs containing code in heap memory. Your program should not be registered with DEP.

All in all then, getting thunks to work can be quite a challenge, but I hope this post has helped to illuminate the dark art for you and given you a practical appreciation of what’s involved.


Lambda lifting of closures

newton_sqrt(n) = {
    lambda(x) = {
        ?(x * x <= n and n < (x+1) * (x+1)) -> x
        self((n + (n / x)) / 2)

This short snippet of VSL defines a function which returns an anonymous function (lambda).  If invoked, lambda will calculate an integer square root given an initial guess, using the Netwon-Raphson method of square-root approximation. The better the initial guess, the faster lambda will converge on an answer.

Note that the body of lambda refers to the variable n (the number whose square root is to be found) even though n is not defined in lambda nor passed as a parameter to it. In the language of Computer Science,  lambda is a closure,  and n is a free variable that it closes over.  Simply put, because lambda is defined within the lexical scope of newton_sqrt, it has access to newton_sqrt‘s bound variable n. If this still isn’t entirely clear, a simple example should help:

f = newton_sqrt(612)	; f = a function to calculate root(612)
g = newton_sqrt(3033)	; g = a function to calculate root(3033)
f(20)     ; calls root(612) with guess 20, returns: 24
f(30)	  ; calls root(612) with guess 30, returns: 24
g(50)     ; calls root(3033) with guess 50, returns: 55
g(99)     ; calls root(3033) with guess 99, returns: 55

Generating runtime code for functions with free variables is not trivial. In stack-based languages, the parameter n to newton_sqrt would normally have been passed on the stack.  But by the time lambda is invoked and the value of n is actually required,  the contents of that stack location may well have changed,  causing it to fail. This problem is not just confined to values passed on the stack: even if a closure has free variables allocated on the heap, the runtime environment must ensure that the values of such variables remain invariant during the lifetime of the closure.  (This is the reason incidentally why Java requires you to declare as final any variable referred to by an anonymous inner class method that is declared outside the inner class body. Even though alternative implementations were proposed,  the language designers opted for the simplest implementation:  force the variable to be immutable)

The fundamental problem to be solved therefore is one of preserving the environment for a function with free variables in such a way that it can access them when invoked, and one popular technique for solving this is known as lambda- lifting

Lambda-lifting is a pure compile-time optimisation whose objective is to avoid entirely the need to carry around a function’s environment at runtime by rewriting all functions with free variables in such a way so as to eliminate them. The technique involves a number of steps, and using the above example, this is broadly how it works:

First, lambda‘s free variable n is explicitly bound:

lambda(n, x) = {
    ?(x * x <= n and n < (x+1) * (x+1)) -> x
    self((x + (n / x)) / 2)	}

Next, lambda is “lifted” out of its enclosing function, assigned a unique name, and placed at the same lexical scope as its enclosing function:

newton_sqrt(n) = {
    __newton_sqrt((n + 1) / 2)

__newton_sqrt(n, x) = {
    ?(x * x <= n and n < (x+1) * (x+1)) -> x
    self((x + (n / x)) / 2)

Finally, all calls to newton_sqrt are rewritten to accept the additional ‘initial guess’ argument, while any references to the original anonymous function have their “initial guess” argument removed:

newton_sqrt(n, x) = {
    __newton_sqrt(n, x)        ; rewritten

__newton_sqrt(n, x) = {
    ?(x * x <= n and n < (x+1) * (x+1)) -> x
    self(n, (x + (n / x)) / 2)  ; rewritten

f = newton_sqrt(612, 20)        ; rewritten
g = newton_sqrt(3033, 50)       ; rewritten
f()                             ; rewritten
g()                             ; rewritten

Lambda-lifting is a powerful technique, but it has one major drawback: the compiler must be able to find all call-sites that require rewriting.  While considering possible implementations for lambda-lifiting in VSL, I came to the conclusion (perhaps wrongly – I’d be glad to be corrected)  that even if  every call-site could be identified, it could not always be rewritten.  The generally failing case seems to be when a function to be rewritten has been passed as an argument to another function;  i.e. where there are call-sites within higher-order functions:

; lazy evaluation, returns sum(0 .. +n) or sum(-n .. 0)
sum(n) = {
    lambda = {
        ?(n < 0)  -> -sum(-n)
        ?(n == 0) ->  0
        n + self(n - 1)

fold(f) = {
    f()		; f() cannot be rewritten

fold( sum(20) )
fold( sum(30) )

Lambdas whose call-sites cannot be completely determined or rewritten at compile time are said to ‘escape’ and must be handled differently.  In the next post,  I describe a different technique for managing closures at runtime, which is the one I actually implemented in VSL.

Pointless code

Now I know that Functional Programming is not just about closures, but the FP community as a whole seems to make a pretty big deal about them, so it is disappointing to find an utterly pointless example in Wikipedia’s article on lambda-lifting.  I’ll be posting about lambda-lifting and other compiler techniques used to bind free variables into closures soon,  but for now we’ll look at what’s wrong with the Wikipedia code. To save you the trouble of having to open two browser tabs and flick between them, here’s a transcription of the algorithm into VSL.

sum(n) = {
    ?(n == 1) -> 1
    f(x) = {
        n + x
    f( sum ( n-1 ) )

This function sums the natural numbers from 1..n.  It does this by creating a closure bound to the free variable n that is declared as a formal parameter to it. This closure is then invoked recursively on decreasing values of n summing as it goes.  You have to be kidding, right?  Leaving aside a deadly bug, which we’ll come to in a moment,  the use of a closure here is utterly unnecessary, and simply obfuscates a proper understanding of the core algorithm, which is:

sum(n) = {
    ?(n == 1) -> 1
    n + sum(n-1)

Not a closure in sight and for good reason.  A closure is never required if you do not intend to defer a function’s execution. Hope you got that all you FP bloggers, because it would be so much better if the numerous websites expounding on the topic of closures actually contained some worthwhile examples.

Moving swiftly on, have you have spotted the bug yet? If not, consider what would happen if you attempted to pass a number less than 1 to this function.  Remember your pre- and post-conditions, invariants and bound functions. But hang on, why are we using recursion at all?  Recursion is expensive, and not all platforms support tail-call optimisations to eliminate it – in particular the JVM.  The curious truth is that if you wrote this function in Scala, Java, Groovy, or indeed Clojure, it would fail for sufficiently large n with a stack overflow. Yes I am aware of Clojure’s get-out-of-jail recur syntax in this regard, but the point surely is that these days you shouldn’t have to be thinking ‘how does the compiler/platform implement this?’,  just as you shouldn’t have to worry about memory management any more.

There’s an awful lot of chat on the internet about how recursion is somehow ‘better’ than iteration, but I have come to the conclusion that supporters of this dogma are simply wrong.  This is not to say that there aren’t some algorithms for which recursion is completely appropriate and the preferred option,  but there are equally situations when an iterative approach is more expressive and natural:

sum(n) = {
    sum = 0, num = n
    loop(num > 0) {
        sum = sum + num

Notice how this implementation does not suffer from the subtle bug in the recursive version because it explicitly states its operating conditions?  Of course it won’t run out of space either, which is helpful.  I think that some of the ‘recursion is better than iteration’ camp take that view because of the fact that iteration always requires state to be maintained, and that somehow state is bad.  Such state-hate is, in my opinion, misguided and the result of confused thinking. The iterative function above is completely thread-safe. Even though it maintains internal state, it does not modify external state so, like its recursive counterpart is entirely free from side-effects. Both the iterative and the recursive functions are equally ‘pure’ in this regard.

Today’s takeaway:  Understand why you make the algorithmic choices you do.  Do not to be beguiled by the apparent elegance of recursion so that you use it without ever thinking of the alternatives.  Equally, use recursion when its appropriate rather than blindly sticking to iteration simply because it can be easier to reason about.  But best of all dearly beloved,  realise that there’s no need to write an O(n) function when an O(1) one will do:

sum(n) = {
    (n * (n + 1)) / 2

VSL gets funky

Well despite my avowed intentions, its been an age since I last posted anything. Other things have grabbed far too much of my attention recently, but the wait has not been in vain dear reader(s), for I am now very pleased to report that VSL now has rudimentary support for Functional Programming. It’s been quite a journey: I’ve had to get to grips with languages like Ruby, Scala and particularly Haskell to understand and appreciate the power – and pitfalls – of the FP way. I’m still learning,  but at least I’ve learned enough that the following compiles and executes correctly:

module funky {
    double(x) = {
        x * 2
    seq = {
        yield 1
        yield 2
        yield 3
    reduce(f, g) = {
        v = 0
        with (x <- g)  {
            v = v + f(x)
    main = {
        print ('%i\n', reduce(double, seq) )

There’s quite a lot of new things going in in there: simple type inferencing, co-routines, lazy evaluation and higher order functions to name just a few, and I’m pretty chuffed to say the least. I’ve also had a fair few Eureka! moments along the way, which I’ll blog about in the next few posts as I delve deeper into these ideas and their implementations in VSL, but for now I’ll just let you bask in the utter glory of this:

C:\vsl\bin> run vsl funky

In search of simplicity

A lively discussion over at The Reinvigorated Programmer has prompted some thoughts from me about the sort of language features VSL might be expected to support as it oh-so-slowly develops. But today’s post isn’t about VSL, it’s about Java, and specifically about simplicity, elegance and brevity. Here’s our challenge:

      “”.methods.sort.grep /index/i

On his blog, a reinvigorated Mike Taylor has been extolling the virtues of Ruby, and in particular the above code snippet which prints every method of the String class that has “index” somewhere in its name. This pared-down Ruby code is contrasted with some pretty average Java, written not by Mike I hasten to add, but by Steve Yegge, and taken from Steve’s article on Allocation Styles:

    List results = new ArrayList();
    Method[] methods = String.class.getMethods();
    for (int i = 0; i < methods.length; i++) {
        Method m = methods[i];
        if (m.getName().toLowerCase().indexOf(“index”) != -1) {

    String[] names = (String[]) results.toArray();
    return names;

It’s not my intention to review this code here, but I would remark in passing that you shouldn’t believe everything you read in the papers, even from supposed experts. My real intention today is to see how we might produce a Java version of the above functionality that approaches the simplicity and brevity of the Ruby code.

we’re going to need a plumber
The first point to note is that Ruby really has done all the heavy lifting for you. For example, sort can be called directly on a Ruby list, array or set, whereas in Java you have to call the static Collections.sort or Arrays.sort methods and pass in whatever it is you want to be sorted. And the icing on the cake for the Rubyist is the native implementation of grep on lists. Java has no such capability. Clearly a Java solution is going to require a bit of remedial plumbing. The second point to note is that there is often more than one way to skin a cat. The code above creates a list and then sorts it but a simpler (and bug-free) solution would simply be to use an instance of TreeSet. No sorting required (and no duplicate method names either).

With those points in mind, and starting with the tip of the iceberg, here’s my Java version. This dumps the full method signature of all the methods of the String class that end in “of”. There’s a whole menagerie of them, so for the sake of brevity I haven’t bothered to print them here.

    Filter f = new Filter(String.class.getMethods());
    for (Iterator i = f.grep("\\.[a-z]+of");  i.hasNext(); ) {

I think you’ll agree that the Java code is reasonably elegant and simple.

ooh – i just used a closure!
The key to this solution lies in two small utility classes. This is the plumbing we have to write in order to compete with Ruby, but we only have to write it once. The first of these is the Filter class. This has two constructors, one that takes an Array as a parameter, and one that takes a Collection. (This bifurcation of container types in Java is one of the clumsiest features of the language. At least by providing two different constructors, we relieve the programmer of the burden of managing their differences). The Filter class has one public method, grep, which returns an Iterator. That’s (almost) it. Here’s the code:

package org.mambofish.filter;

import java.util.Collection;
import java.util.Iterator;
import java.util.Set;
import java.util.TreeSet;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Filter<T> {

    private Set<String> mySet = new TreeSet();

    public Filter(Iterable iterable) {
	for (Object o : iterable)

    public Filter(Object[] array) {

    public Iterator<String> grep(String regularExpression) {
        final Pattern pattern = Pattern.compile(regularExpression,
                                       Pattern.CASE_INSENSITIVE | Pattern.DOTALL | Pattern.MULTILINE);
	final Matcher matcher = pattern.matcher("");
	return new FilteringIterator(mySet.iterator()) {
	    protected boolean accept(Object o) {
		return matcher.find();

The grep method takes a regular expression and returns an Iterator over only those objects that match. The iterator itself is implemented as a closure – an anonymous instance of a FilteringIterator, that gets created whenever grep is called, and which has access to the matcher object in its enclosing scope (well it appears to, but it cheats actually, which is why its not a real Lispy closure). Anyway, this instance provides a concrete implementation of FilteringIterator’s abstract method accept, thereby allowing us to select only those objects that match our regular expression.

FilteringIterator itself is pretty straightforward: it simply implements the Iterator interface, delegating most of the hard work to the actual iterator of the source set:

package org.mambofish.filter;
import java.util.Iterator;
import java.util.NoSuchElementException;

public abstract class FilteringIterator implements Iterator {
    private Iterator source;
    private Object next;

    public FilteringIterator(Iterator s) {
	source = s;

    public void remove() {
	throw new UnsupportedOperationException();

    public boolean hasNext() {
	while ((next == null) && (source.hasNext())) {
	    Object o =;
	    if (accept(o)) {
	        next = o;
		return true;
	return (next != null);

    protected abstract boolean accept(Object o);

    public Object next() {
	if (!hasNext()) {
	    throw new NoSuchElementException();
	Object obj = next;
	next = null;
	return obj;

in conclusion
I’m not going to deny that Java is a verbose language. The excess verbiage is part of the price you pay for all that type safety. There are many aspects of Java’s syntax that leave a lot to desired, and as I have said, Java’s implementation of containers is deeply flawed. It is questionable whether 1.5 – and specifically generics – has improved the language or moved it a bit further down the road towards early retirement. It’s certainly showing its age, having been conceived in 1995. But despite all that, it is remarkably flexible, and easy to learn (I am not talking here about the numerous frameworks that have Sprung up around it). But, as I hope I have demonstrated, it is occasionally possible to write simple, concise elegant Java code. Perhaps even more than occasionally.

Carrying out the trash

trash canAt some point I’m going to have to get VSL to clean up old unused objects, because I don’t want to have to leave that responsibility up to programmers (ok, that’s just me right now, but hey). There are two main paradigms currently in use: collection tracing and referencing counting. Collection tracing releases unused objects after checking that they cannot be reached from any other object in memory. Reference counting releases objects when a special value on the object itself – its reference count – reaches zero. Both schemes have their drawbacks and advantages. For example, reference counting techniques typically find it difficult to process object cycles, whilst tracing implementations usually require the introduction of finalizers into the language, with all their attendant problems of non-deterministic runtime behaviour. Tracing can suffer from the “stop the world” problem, where the running program is effectively halted while unused objects are released back to the heap, while reference counting can cause CPU pipeline thrashing and adversely affect the performance of the program as a result.

The more I read about this, the more I realise that there is no perfect solution. There are literally thousands of computer science papers on the subject. However, some fairly recent ideas about new reference counting schemes have piqued my interest; and in any case a reference counting scheme is certainly much simpler to implement than a full-on tracing garbage collector.

Additionally, I don’t need to be too perfectionist. If I am prepared to put up with the possibility that not all unused objects may get released I am hopeful that I can write a reasonably efficient reference-counting garbage collector. Thankfully there are many who have trod this path before me, so I am not short of sample code to look at. And as I mentioned earlier, there is a wealth of academic papers to refer to. If you’re interested in this sort of thing, a good place to start (it was for me) might be here

Invoking Interfaces – Part II

As promised, here is the second and final part on invoking interface methods. As I mentioned in the first post, the fundamental problem is to discover which class method to call at runtime. The solution I show below is not the only way to do it, but it is simple and very quick, which are among the stated design goals for the language. On the downside, we pay a penalty for this performance is in terms of semantic limitations on the language itself. We restrict the number of unique interfaces allowed in any one executable to 255, and the maximum number of interfaces that any single class may implement to 4.

the class i-table
When the compiler parses a class definition, it maintains a list of up to four 8-bit interface ids that are implemented by the class. This data is written into the executable’s global data section and is the start of what is known as the class i-table.

Immediately following the 32 bits of the i-table header are four additional 32-bit entries, corresponding to the 4 bytes of the header. This is the i-table dispatch structure. Each entry in the dispatch structure points to an area of memory, which in turn points to the first method name of that interface implemented by the class. An example should help clarify:

an example
Assume we have two interfaces, I1 and I2 and a class C that implements them both. I1 has 3 methods (a, b, c), and I2 has 4 (d, e, f, g). Let us also suppose that these interfaces have been allocated interface ids of 47 and 93 respectively. The assembly output for class C’s i-table looks like this:

__C$ITable:  DB 0, 0, 93, 47
             DWORD __C$IF47
             DWORD __C$IF93
             DWORD 0
             DWORD 0
__C$IF47:    DWORD C_a
             DWORD C_b
             DWORD C_c
__C$IF93:    DWORD C_d
             DWORD C_e
             DWORD C_f
             DWORD C_g

linking an instance of a class to the class i-table
When a new object is created, the first 4 bytes of the object in memory are set to point to its class i-table. In this way, every instance of a class carries around with it the information relating to which interfaces the class supports. Here is the compiler output for a call to, showing how this is done.

	invoke	HeapAlloc, hProcessHeap, 8, 16
	mov	[eax], OFFSET __C$ITable	; i-table ptr
	mov	ebx, [ebp + 8]
	mov	[ebx + 4], eax   ; assign memory to object ref
	push	eax
	call	C_new

putting it all together
The last thing to be done is to actually call the implementing class method when an interface method is invoked.  First, any arguments required by the method are pushed onto the stack in the normal way. Next, the compiler places the current object’s <this> reference in register ebx (where it will usually be anyway), and places the method signature (interface_id * 256 + method_id) into register eax. It then calls the runtime support procedure __invokeInterfaceMethod:

    mov    eax, 93*256 + 2
    mov    ebx, [this]
    call    __invokeInterfaceMethod

The __invokeInterfaceMethod obtains the address of the 32-bit i-table header, a pointer to the first entry in the i-table dispatch structure, and decomposes the method-signature into the corresponding method-id and interface-id :

__invokeInterfaceMethod PROC
    mov        esi, [ebx]
    mov        edx, eax
    lea        ecx, [esi + 4]
    shr        eax, 8                    ; interface id
    and        edx, 0xff                ; method-id

Next, it finds the entry in the dispatch structure that points to the specified interface’s first method, by checking each byte in the i-table header, looking for a match to the interface id. If it fails to locate it, something is seriously wrong…

    cmp    BYTE PTR esi, al
    je      @F
    shr    esi, 8
    add    ecx, 4
    cmp    BYTE PTR esi, al
    je      @F
    shr    esi, 8
    add    ecx, 4
    cmp    BYTE PTR esi, al
    je      @F
    shr    esi, 8
    add    ecx, 4
    cmp    BYTE PTR esi, al
    je      @F

    print "Fatal Error. A required interface is not implemented. Program will exit", 10, 13

Finally, it calculates the offset to the interface method’s implementation pointer, and jumps to the address it points to.

    lea    ecx, [ecx + edx * 4]
    jmp    DWORD PTR [ecx]

__invokeInterfaceMethod ENDP

And that’s all there is to it! Note that we uses a jump, not a call to our method, so its return will take us back to the instruction directly after the call to __invokeInterfaceMethod, which is exactly where we want to be. This is safe to do because we have not modified the stack.