I ran into a case where a small script was taking much longer on a particular operating system than others. Using Python's built-in cProfile module, I was easily able to create a table of the functions that were taking the most time:

>python -O -mcProfile -scumtime load-all.py
         330729 function calls (324916 primitive calls) in 0.680 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    236/1    0.020    0.000    0.681    0.681 {built-in method builtins.exec}
        1    0.000    0.000    0.681    0.681 load-all.py:1(<module>)
    278/3    0.002    0.000    0.457    0.152 <frozen importlib._bootstrap>:966(_find_and_load)
    278/3    0.001    0.000    0.457    0.152 <frozen importlib._bootstrap>:936(_find_and_load_unlocked)
    238/3    0.001    0.000    0.456    0.152 <frozen importlib._bootstrap>:651(_load_unlocked)
    ...

However, the relative times spent in various functions were not the same across the platforms I was testing. When the distribution of cumulative time in functions is different, the comparison becomes complicated and a bit of reorganization helps to identify problem areas.

Gathering Data: Profiling the Script on Multiple Platforms

The over-all goal of this work was to improve the beginning to end run time of a script. Therefore, import time and all set-up and tear-down work was important and to be measured. Luckily, this made it extremely easy to profile right from the command line with a simple:

> python -O -mcProfile -oload-data.linux.prof load-data.py

So the output the cProfile run on the script load-data.py was output to a file called load-data.linux.prof. Great! I ran a similar command on Windows to get a file I called load-data.win.prof. Noticed I used the -O option when running this to avoid any if __debug__ or assert statements since these are really only useful for debugging and should not be relied upon in production code. In fact, I'm going to go ahead and claim that normal operation of any Python script should be done with the -O option, dropping it only when debugging, altering or writing code.

One thing we'll have to do is to strip the leading directories on all function identifiers on the system where the profiling was done. This is because the pstats.Stats.strip_dirs() method does not currently handle windows paths when analyzed on a Linux machine. This can be done using the interpreter fairly easily:

>>> import pstats
>>> s = pstats.Stats('load-data.win.prof')
>>> s.strip_dirs().dump_stats('load-data.win.prof')

and the profiling data should be overwritten without the leading paths in the filenames.

Initial Comparison

The first pass in comparing the timings measured by cProfile is easy: just list them side by side. Using the pstats Python module, I used this script:

import os, pstats, sys

class ProfileResults(pstats.Stats):
    def __init__(self, filename):
        super().__init__(filename)
        self.name = os.path.splitext(os.path.basename(filename))[0]
        self = self.strip_dirs()

results = [ProfileResults(f) for f in sys.argv[1:]]
results[0].sort_stats('cumtime').print_stats(20)

to print the timings sorted by cumulative time spent in each function. If this script seems overly complicated, don't worry about it quite yet - recall we are setting up for the comparison between two profiling results. Anyways, here are my results for Linux:

load-all.linux.prof

         527040 function calls (521169 primitive calls) in 0.812 seconds

   Ordered by: cumulative time
   List reduced from 1954 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    236/1    0.018    0.000    0.815    0.815 {built-in method builtins.exec}
        1    0.000    0.000    0.815    0.815 load-all.py:1(<module>)
    278/3    0.002    0.000    0.427    0.142 <frozen importlib._bootstrap>:966(_find_and_load)
    278/3    0.001    0.000    0.427    0.142 <frozen importlib._bootstrap>:936(_find_and_load_unlocked)
    238/3    0.001    0.000    0.426    0.142 <frozen importlib._bootstrap>:651(_load_unlocked)
    193/3    0.001    0.000    0.426    0.142 <frozen importlib._bootstrap_external>:672(exec_module)
    339/3    0.000    0.000    0.425    0.142 <frozen importlib._bootstrap>:211(_call_with_frames_removed)
   896/40    0.001    0.000    0.418    0.010 <frozen importlib._bootstrap>:997(_handle_fromlist)
     96/7    0.000    0.000    0.418    0.060 {built-in method builtins.__import__}
    50/20    0.000    0.000    0.371    0.019 contextlib.py:49(inner)
       10    0.000    0.000    0.292    0.029 load.py:755(load_tecplot)
       10    0.000    0.000    0.275    0.028 load.py:21(_tecplot_loader_load_data)
        1    0.000    0.000    0.257    0.257 session.py:207(connect)
        1    0.000    0.000    0.257    0.257 tecutil_connector.py:124(connect)
        1    0.000    0.000    0.240    0.240 tecutil_client.py:2(<module>)
       17    0.001    0.000    0.228    0.013 __init__.py:1(<module>)
      150    0.002    0.000    0.221    0.001 preamble.py:20(_fn)
      150    0.204    0.001    0.216    0.001 tecutil_client.py:362(sndrcv)
        1    0.000    0.000    0.210    0.210 message_pb2.py:4(<module>)
        1    0.000    0.000    0.194    0.194 __init__.py:33(<module>)

and for Windows:

load-all.win.prof

         290648 function calls (283683 primitive calls) in 1.947 seconds

   Ordered by: cumulative time
   List reduced from 2044 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    245/1    0.003    0.000    1.948    1.948 {built-in method builtins.exec}
        1    0.000    0.000    1.948    1.948 load-all.py:1(<module>)
    50/20    0.000    0.000    1.332    0.067 contextlib.py:71(inner)
      130    0.001    0.000    1.330    0.010 preamble.py:20(_fn)
      130    1.295    0.010    1.328    0.010 tecutil_client.py:362(sndrcv)
       10    0.000    0.000    1.187    0.119 load.py:755(load_tecplot)
       10    0.000    0.000    0.927    0.093 load.py:21(_tecplot_loader_load_data)
    306/3    0.003    0.000    0.591    0.197 <frozen importlib._bootstrap>:978(_find_and_load)
    306/3    0.001    0.000    0.591    0.197 <frozen importlib._bootstrap>:948(_find_and_load_unlocked)
    244/3    0.001    0.000    0.587    0.196 <frozen importlib._bootstrap>:663(_load_unlocked)
    202/3    0.001    0.000    0.587    0.196 <frozen importlib._bootstrap_external>:722(exec_module)
    380/3    0.000    0.000    0.586    0.195 <frozen importlib._bootstrap>:211(_call_with_frames_removed)
   832/45    0.001    0.000    0.573    0.013 <frozen importlib._bootstrap>:1009(_handle_fromlist)
    117/7    0.001    0.000    0.572    0.082 {built-in method builtins.__import__}
        1    0.000    0.000    0.378    0.378 session.py:207(connect)
        1    0.000    0.000    0.378    0.378 tecutil_connector.py:124(connect)
        1    0.000    0.000    0.355    0.355 tecutil_client.py:2(<module>)
       10    0.000    0.000    0.345    0.034 layout.py:332(active_frame)
       16    0.000    0.000    0.339    0.021 __init__.py:1(<module>)
        1    0.000    0.000    0.290    0.290 message_pb2.py:4(<module>)

This script calls into native libraries quite a bit and so the tottime is mostly time spent in compiled C++ code. If we can identify this as the culprit, then we could fire up valgrind or similar to start profiling that. But since we are really interested in the beginning-to-end run time of the script, we sort first on cumulative time as shown above.

At this point, we can take this information and start working on optimizing the longest running methods and work our way down. But let's say we've done quite a bit of this already and we now want to understand the differences when running on these two platforms. One reason for wanting to do this would be if a script runs many times slower on a specific system and we want to understand why.

Reorganization for a Better Understanding

The first thing to notice is that the script takes about twice as long on Windows as it does on Linux. The computer hardware was similar but not exactly so and we should try to normalize the times across the platforms tested. The only normalization factor that makes sense at this point in the analysis is the total time as shown in the results above: 0.812 s on Linux and 1.947 s on Windows. Adding a normalization method to our script effectively transforms all times to fractions of total execution time:

import os, pstats, sys

class ProfileResults(pstats.Stats):
    def __init__(self, filename):
        super().__init__(filename)
        self.name = os.path.splitext(os.path.basename(filename))[0]
        self = self.strip_dirs()
        self.normalize_times()

    def normalize_times(self):
        total_tt = self.total_tt
        oldstats = self.stats
        self.stats = {}
        for func, (cc, nc, tt, ct, oldcallers) in oldstats.items():
            tt /= total_tt
            ct /= total_tt
            callers = {}
            for func2, (cc2, nc2, tt2, ct2) in oldcallers.items():
                tt2 /= total_tt
                ct2 /= total_tt
                callers[func2] = (cc2, nc2, tt2, ct2)
            self.stats[func] = (cc, nc, tt, ct, callers)

results = [ProfileResults(f) for f in sys.argv[1:]]
results[0].sort_stats('cumtime').print_stats(20)

And Now our results look like this for Linux:

load-all.linux.prof

         527040 function calls (521169 primitive calls) in 0.812 seconds

   Ordered by: cumulative time
   List reduced from 1954 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    236/1    0.022    0.000    1.003    1.003 {built-in method builtins.exec}
        1    0.000    0.000    1.003    1.003 load-all.py:1(<module>)
    278/3    0.002    0.000    0.525    0.175 <frozen importlib._bootstrap>:966(_find_and_load)
    278/3    0.002    0.000    0.525    0.175 <frozen importlib._bootstrap>:936(_find_and_load_unlocked)
    238/3    0.002    0.000    0.524    0.175 <frozen importlib._bootstrap>:651(_load_unlocked)
    193/3    0.001    0.000    0.524    0.175 <frozen importlib._bootstrap_external>:672(exec_module)
    339/3    0.000    0.000    0.524    0.175 <frozen importlib._bootstrap>:211(_call_with_frames_removed)
   896/40    0.001    0.000    0.515    0.013 <frozen importlib._bootstrap>:997(_handle_fromlist)
     96/7    0.000    0.000    0.515    0.074 {built-in method builtins.__import__}
    50/20    0.001    0.000    0.457    0.023 contextlib.py:49(inner)
       10    0.000    0.000    0.359    0.036 load.py:755(load_tecplot)
       10    0.001    0.000    0.339    0.034 load.py:21(_tecplot_loader_load_data)
        1    0.000    0.000    0.316    0.316 session.py:207(connect)
        1    0.000    0.000    0.316    0.316 tecutil_connector.py:124(connect)
        1    0.000    0.000    0.296    0.296 tecutil_client.py:2(<module>)
       17    0.001    0.000    0.281    0.017 __init__.py:1(<module>)
      150    0.002    0.000    0.272    0.002 preamble.py:20(_fn)
      150    0.251    0.002    0.266    0.002 tecutil_client.py:362(sndrcv)
        1    0.000    0.000    0.259    0.259 message_pb2.py:4(<module>)
        1    0.000    0.000    0.239    0.239 __init__.py:33(<module>)

and for Windows:

load-all.win.prof

         290648 function calls (283683 primitive calls) in 1.947 seconds

   Ordered by: cumulative time
   List reduced from 2044 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    245/1    0.002    0.000    1.000    1.000 {built-in method builtins.exec}
        1    0.000    0.000    1.000    1.000 load-all.py:1(<module>)
    50/20    0.000    0.000    0.684    0.034 contextlib.py:71(inner)
      130    0.000    0.000    0.683    0.005 preamble.py:20(_fn)
      130    0.665    0.005    0.682    0.005 tecutil_client.py:362(sndrcv)
       10    0.000    0.000    0.610    0.061 load.py:755(load_tecplot)
       10    0.000    0.000    0.476    0.048 load.py:21(_tecplot_loader_load_data)
    306/3    0.001    0.000    0.303    0.101 <frozen importlib._bootstrap>:978(_find_and_load)
    306/3    0.001    0.000    0.303    0.101 <frozen importlib._bootstrap>:948(_find_and_load_unlocked)
    244/3    0.001    0.000    0.302    0.101 <frozen importlib._bootstrap>:663(_load_unlocked)
    202/3    0.001    0.000    0.302    0.101 <frozen importlib._bootstrap_external>:722(exec_module)
    380/3    0.000    0.000    0.301    0.100 <frozen importlib._bootstrap>:211(_call_with_frames_removed)
   832/45    0.000    0.000    0.294    0.007 <frozen importlib._bootstrap>:1009(_handle_fromlist)
    117/7    0.000    0.000    0.294    0.042 {built-in method builtins.__import__}
        1    0.000    0.000    0.194    0.194 session.py:207(connect)
        1    0.000    0.000    0.194    0.194 tecutil_connector.py:124(connect)
        1    0.000    0.000    0.182    0.182 tecutil_client.py:2(<module>)
       10    0.000    0.000    0.177    0.018 layout.py:332(active_frame)
       16    0.000    0.000    0.174    0.011 __init__.py:1(<module>)
        1    0.000    0.000    0.149    0.149 message_pb2.py:4(<module>)

From here, we can finally start to compare directly the results of the profiling and first off, we notice that importing Python modules seems to be taking more time on Linux -- recall that from now on, when we say "time spent," it is always in fractions of total execution time. Read another way, we can say that the operations in and around the tecutil_client.py:362(sndrcv) method are taking longer on Windows than on Linux.

After some thought, I realized what I wanted to know was the relative difference in times for the various functions. So, focusing on cumulative time, we now print out the times for each platform along with the difference between them, sorting on the absolute magnitude of this difference. Here is the new comparison script:

import math, os, pstats, sys

class ProfileResults(pstats.Stats):
    def __init__(self, filename):
        super().__init__(filename)
        self.name = os.path.splitext(os.path.basename(filename))[0]
        self = self.strip_dirs()
        self.normalize_times()

    def normalize_times(self):
        total_tt = self.total_tt
        oldstats = self.stats
        self.stats = {}
        for func, (cc, nc, tt, ct, oldcallers) in oldstats.items():
            tt /= total_tt
            ct /= total_tt
            callers = {}
            for func2, (cc2, nc2, tt2, ct2) in oldcallers.items():
                tt2 /= total_tt
                ct2 /= total_tt
                callers[func2] = (cc2, nc2, tt2, ct2)
            self.stats[func] = (cc, nc, tt, ct, callers)

    def compare_cumtime(self, other):
        res = []
        for func, (_, _, _, ct0, _) in self.stats.items():
            try:
                ct1 = other.stats[func][3]
            except KeyError:
                ct1 = math.nan
            res.append(('{}:{}({})'.format(*func), ct0, ct1))
        return res

results = [ProfileResults(f) for f in sys.argv[1:]]
ctcomp = results[0].compare_cumtime(results[1])
ctcomp = sorted(ctcomp, key=lambda x: -abs(x[2] - x[1]))

print('''cumulative time:
 {}  {}   diff   function'''.format(*[r.name for r in results]))
for func, ct0, ct1 in ctcomp[:20]:
    print(' {0: ^{1}.3f}  {2: ^{3}.3f}  {4: >+6.3f}  {5}'.format(
        ct0, len(results[0].name), ct1, len(results[1].name),
        ct1 - ct0, func))

and here are the results:

cumulative time:
 load-all.linux  load-all.win   diff   function
     0.012          0.103      +0.091  tecutil_rpc.py:1637(DataSetIsAvailableForFrame)
     0.010          0.088      +0.078  tecutil_rpc.py:2431(FrameGetActiveID)
     0.012          0.088      +0.077  tecutil_rpc.py:3898(PageGetUniqueID)
     0.005          0.044      +0.040  tecutil_rpc.py:2443(FrameGetDataSetUniqueIDByFrameID)
     0.005          0.045      +0.040  tecutil_rpc.py:5142(StringListAlloc)
     0.005          0.044      +0.039  tecutil_rpc.py:2439(FrameGetCount)
     0.006          0.044      +0.038  tecutil_rpc.py:5167(StringListDealloc)
     0.006          0.044      +0.038  tecutil_rpc.py:5152(StringListAppendString)
     0.096          0.074      -0.022  tecutil_rpc.py:3845(NewLayout)
     0.001          0.000      -0.001  tecutil_rpc.py:2(<module>)
     0.000          0.000      -0.000  tecutil_rpc.py:21(TecUtilRPC)
     0.006           nan         +nan  tecutil_rpc.py:5177(StringListGetCount)
     0.005           nan         +nan  tecutil_rpc.py:5182(StringListGetRawStringPtr)
     0.030          0.009      -0.021  pyparsing.py:2738(__init__)
     0.033          0.014      -0.019  pyparsing.py:61(<module>)
     0.014          0.006      -0.009  pyparsing.py:5353(pyparsing_common)
     0.000          0.006      +0.006  reflection.py:46(<module>)
     0.103          0.107      +0.005  tecutil_rpc.py:1685(DataSetReadX)
     0.007          0.003      -0.004  pyparsing.py:4210(__init__)
     0.007          0.002      -0.004  pyparsing.py:3702(leaveWhitespace)

Reading from top, down we see that, on Windows, the function DataSetIsAvailableForFrame is taking up almost an order of magnitude more time than on Linux along with several StringList related functions. Also, it looks like StringListGetCount is not called at all on Windows - this may be OK, but it would be best to understand why it's only called on Linux. It may mean that the code makes some assumption that is good on one platform but not on the other, and the point is not these specific functions but rather that we now have a direction to go in the investigation of script performance.

Generalization of Profile Comparisons and Conclusion

It may not always be possible to test the exact code against multiple platforms, especially if you do not have direct access to the platform you are debugging. However, if you can run similar code and you still want to compare the profiling results, you can make one more change to the script above to ensure a functions match between datasets, and that is: remove the line number in the function key. To do this, I set the line numbers to zero and then call pstats.Stats.strip_dirs() one more time. The final script looks like this:

import math, os, pstats, sys

class ProfileResults(pstats.Stats):
    def __init__(self, filename):
        super().__init__(filename)
        self.name = os.path.splitext(os.path.basename(filename))[0]
        self = self.strip_dirs()
        self.normalize_times()
        self = self.strip_dirs()

    def normalize_times(self):
        total_tt = self.total_tt
        oldstats = self.stats
        self.stats = {}
        for func, (cc, nc, tt, ct, oldcallers) in oldstats.items():
            func = (func[0], 0, func[2])
            tt /= total_tt
            ct /= total_tt
            callers = {}
            for func2, (cc2, nc2, tt2, ct2) in oldcallers.items():
                func2 = (func2[0], 0, func2[2])
                tt2 /= total_tt
                ct2 /= total_tt
                callers[func2] = (cc2, nc2, tt2, ct2)
            self.stats[func] = (cc, nc, tt, ct, callers)

    def compare_cumtime(self, other):
        res = []
        for func, (_, _, _, ct0, _) in self.stats.items():
            try:
                ct1 = other.stats[func][3]
            except KeyError:
                ct1 = math.nan
            res.append(('{0}({2})'.format(*func), ct0, ct1))
        return res

results = [ProfileResults(f) for f in sys.argv[1:]]
ctcomp = results[0].compare_cumtime(results[1])
ctcomp = sorted(ctcomp, key=lambda x: -abs(x[2] - x[1]))

print('''cumulative time:
 {}  {}   diff   function'''.format(*[r.name for r in results]))
for func, ct0, ct1 in ctcomp[:20]:
    print(' {0: ^{1}.3f}  {2: ^{3}.3f}  {4: >+6.3f}  {5}'.format(
        ct0, len(results[0].name), ct1, len(results[1].name),
        ct1 - ct0, func))

and the results from the data above now looks like this:

cumulative time:
 load-all.linux  load-all.win   diff   function
     0.012          0.103      +0.091  tecutil_rpc.py(DataSetIsAvailableForFrame)
     0.010          0.088      +0.078  tecutil_rpc.py(FrameGetActiveID)
     0.012          0.088      +0.077  tecutil_rpc.py(PageGetUniqueID)
     0.005          0.044      +0.040  tecutil_rpc.py(FrameGetDataSetUniqueIDByFrameID)
     0.005          0.045      +0.040  tecutil_rpc.py(StringListAlloc)
     0.005          0.044      +0.039  tecutil_rpc.py(FrameGetCount)
     0.006          0.044      +0.038  tecutil_rpc.py(StringListDealloc)
     0.006          0.044      +0.038  tecutil_rpc.py(StringListAppendString)
     0.096          0.074      -0.022  tecutil_rpc.py(NewLayout)
     0.001          0.000      -0.001  tecutil_rpc.py(<module>)
     0.000          0.000      -0.000  tecutil_rpc.py(TecUtilRPC)
     0.006           nan         +nan  tecutil_rpc.py(StringListGetCount)
     0.005           nan         +nan  tecutil_rpc.py(StringListGetRawStringPtr)
     0.062          0.043      -0.019  requirements.py(<module>)
     0.033          0.014      -0.019  pyparsing.py(<module>)
     0.018          0.001      -0.017  version.py(__init__)
     0.018          0.002      -0.016  version.py(parse)
     0.022          0.006      -0.016  specifiers.py(<module>)
     0.019          0.005      -0.014  specifiers.py(Specifier)
     0.014          0.006      -0.009  pyparsing.py(pyparsing_common)

So while the first actions taken to address performance of any code is to go after the slowest or longest-running sections, it may also help to evaluate and compare what these are across several of the target platforms.