Diving into a CAD library with call tree analysis, Part 2

Previously, I showed how you can study which functions call what in a library by using gdb catchpoints to observe a running process’ call stack every time a syscall is made in the application. That was useful to understand the parts of the program related to external libraries, graphics, multi-threading, and compiler optimization, but it didn’t provide any insight into the mathematical and algorithmic core of the library which generally doesn’t do any syscalls. Here I’ll be showing how you can do this with the tool callgrind.

Callgrind is a utility that comes with the Valgrind software instrumentation framework that tracks not only the function call graph, but also how often the functions are called and what portion of a program’s run time is in a particular function. This is very similar analysis to what I did with gdb, but thanks to Valgrind’s magic, it’s able to capture the mathematical internals, though it doesn’t capture much of the system-level goings on and it doesn’t yield any insights into what arguments functions are called with, which the previous method did.

To use it, I executed the DRAWEXE executable as so.

. env.sh d
valgrind --tool=callgrind DRAWEXE

I then ran the same experiment as before, having it render out a drill bit, moving the camera a little, and then closing out the program. This was considerably slower than with gdb. The render took something like 30 minutes and after a certain point, I almost stopped the experiment thinking it was hanging until something was printed to the console, showing that it was just very slow.

That produced a file callgrind.out.11216, where 11216 was the process ID of DRAWEXE. It was 12 MB of data, less than what was generated by the gdb test with more information provided per function, probably because of high redundancy in the gdb method’s archive. To parse the data, there’s a helpful utility gprof2dot that parses output from many profiling tools, including callgrind.

./gprof2dot.py \
    /path/to/callgrind.out.11216 \
    -f callgrind \
    -o callgrind.out.11216.dot
    # more options:
    # --color-nodes-by-selftime colors functions based on how long they alone took
    # --node-thres=0.1 reduces the node filtering to paths that took >0.1% of runtime

It outputs them as a DOT file which can be viewed using a variety of tools. I used XDot.

Figure 1: Every function is represented as a block with caller functions drawn above and connected to its callees. The block colors are warmer hues when more time is spent in that function or its children. Click on the image to see the fine details. You’ll need scroll a bit to see the actual image contents.

Figure 1: Every function is represented as a block with caller functions drawn above and connected to its callees. The block colors are warmer hues when more time is spent in that function or its children. Click on the image to see the fine details. You’ll need scroll a bit to see the actual image contents.

The hottest path in the figure, running from main at the top in red, to the bottom of the orange Draw_Interpretor::CallBackDataFunc::Invoke and down to buildsweep in green appears to be related to creating the flutes for the drill bit. Essentially everything here is related to geometric computation.

Another insight from this image that can be gleaned from staring at this image for a long time is that polynomial evaluation takes up quite a large portion of runtime. I looked into the source, and it uses Horner’s method to get the value of a polynomial, which turns \(ax^3 + bx^2 + cx +d\) into \(d + x(c + x(b + xa))\), which requires fewer multiplications, especially for large polynomials. But as I understand it, De Boor’s algorithm is somehow more performant for splines. I wonder if that’s actually true, or maybe there’s a sub-optimal implementation here.

This analysis omits any functions that were on paths that took up less that 0.5% of the program’s runtime, but that can be changed with gprof2dot. I’ll provide the callgrind output if you’re interested in playing with this data.

I think between callgrind and gdb, there’s plenty of tooling to learn how this library works. If I were to keep pulling on this call tree thread, I might look in to static analysis tools that get call trees from studying the source code. I understand this command works for C++ files.

clang -S -emit-llvm main.cpp -o - | opt -analyze -dot-callgraph

However, OCCT has a very complex build system which would make doing this very complicated. Not to mention that they don’t seem to use LLVM-based compilers, so it’s not even guaranteed to work. But perhaps I’ll explore this anyway at a later date.

Thought that was interesting? Let me know by emailing me.