How to use MAQAO to improve application performance

Author: Frédéric Pinel (LuxProvide)

MAQAO introduction

MAQAO is an open-source project, from its home page:

MAQAO (Modular Assembly Quality Analyzer and Optimizer) is a performance analysis and optimization framework operating at binary level with a focus on core performance. Its main goal is to guide application developers along the optimization process through synthetic reports and hints.
MAQAO mixes both dynamic and static analyses based on its ability to reconstruct high level structures such as functions and loops from an application binary. Since MAQAO operates at binary level, it is agnostic with regard to the language used in the source code and does not require recompiling the application to perform analyses. MAQAO has also been designed to concurrently support multiple architectures. Currently the Intel64 and Xeon Phi architectures are implemented.

While MAQAO is well suited to the analysis of assembly and machine code in the context of compiler development, it can be useful to analyze loops and functions for the improvement of an application’s performance.

It comes with a top-level interface, oneview, which calls its various underlying tools, and greatly simplifies the analysis.

oneview is also well suited to an HPC environment because the results are saved to text files (or HTML), which can be retrieved for inspection on a graphical user environment such as a web browser.

In this HowTo we’ll be using oneview, and its text report format.

HPC usage

Our running example is fast enough such that this How-To guide can be replicated on a personal computer. To follow the How-To guide on an HPC, for example EuroHPC’s Meluxina, the following Lmod modules are required.

$ module load GCC MAQAO

MAQAO run 1

For our first analysis with MAQAO, we’ll compile the program without proper attention to compilation options:
$ gcc -Og -g -std=gnu11 search.c -o search-Og.out

On a larger input le of 675 391 words (the sum of Thoreau’s works mentioned before, plus Jane Eyre and Moby Dick), the results of the execution of the sequential search are reported below (still 5 iterations).

$ ./search-Og.out ALL.txt 5
read 3472KiB (4096KiB allocated) from ALL.txt
ALL.txt parsed, contains 675391 words (capacity 1777798 words)
0 THE 39331

7 ON 3673
8 DUTY 72
9 CIVIL 27
counted 675391 words, score=2828 attempts
naive duration_sec 9.10
0 EBOOKS 28

7 OUR 1157
8 SUBSCRIBE 4
9 HOW 971
counted 675391 words, score=1845 attempts
heur duration_sec 8.98

We see that for this input size the move to top outperforms the naive sequential search, in both runtime and score (average position in the word count list).
The initial performance is 9.10 seconds, we’ll use this as our reference.

 

MAQAO recommendations

Now we are ready to use MAQAO. The command line below lists the typical parameters to oneview.

$ maqao oneview \

> –create-report=one \

> -xp=report-Og \

> –output-format=text \

> -executable=./search-Og.out \

> –run-command=” ALL.txt 5″ \

> –with-POP \

> –lprof-params=”–sampling-rate=high –stdout-start-keywords=parsed” \

> -dbg=2 \

> | tee search-Og.log

  • create-report=one is required.
  • xp is the directory for the report les.
  • output-format=text selects the text format, default is html, all formats can be generated too (all).
  • executable and run-commands set the program execution options.
  • with-POP includes additional information.
  • lprof-params allows to pass extra parameters to the profiler tool, called by oneview.
    Although some lprof parameters are directly settable via oneview, it is better to add them here as it offers all lprof parameters and are better documented (note the space between parameters, this is incorrectly documented). The lprof parameters set here are useful to obtain accurate metrics, but because they introduce more overhead (slows down execution), specifying a pro ling start condition (and eventually stop) provides a nice tradeoff. The pro ling start condition is the presence on stdout of the word “parsed”, which is actually output by the program.
  • dbg=2 this is recommended because it provides more information on what is actually run (necessary to validate the parameters set are actually applied), and lists example commands to later access specific parts of the MAQAO report.

The MAQAO report is quite long. Starting from the top, section 2.1 – Stylizer already provides useful recommendations, here are the most interesting items:

[0 / 3] Some functions are compiled with a low optimization level (O0 or O1)
To have better performances, it is advised to help the compiler by using a proper optimization level (-O2 of higher). Warning, depending on compilers, faster optimization levels can decrease numeric accuracy.

[2.4 / 3] Most of time spent in analyzed modules comes from functions compiled with -g
-g option gives access to debugging information, such are source locations. Add -fno-omit-frame-pointer to improve the accuracy of callchains found during the application profiling

[0 / 3] Compilation of some functions is not optimized for the target processor
-march=x86-64 option is used but it is not specific enough to produce efficient code.
Architecture specific options are needed to produce efficient code for a specific processor ( -x(target) or -ax(target) ).

In the next section we will apply those recommendations.

MAQAO run 2

Applying previous recommendations

The recommendations are simple to apply:

$ gcc -O3 -g -fno-omit-frame-pointer -std=gnu11 -march=native search.c -o search-O3.out

We run MAQAO again, setting a different output directory and log file:

$ maqao oneview \
> –create-report=one \
> -xp=report-3g \
> –output-format=text \
> -executable=./search-O3.out \
> –run-command=”<executable> ALL.txt 5″ \
> –with-POP \
> –lprof-params=”–sampling-rate=high –stdout-start-keywords=parsed” \
> -dbg=2 \
> | tee search-O3.log

All the previous problems reported in section 2.1 - Stylizer are now fixed.
Let’s observe the results of these changes on an execution of our sequential search application (outside the MAQAO tool):

$ ./search-O3.out ALL.txt 5
read 3472KiB (4096KiB allocated) from ALL.txt
ALL.txt parsed, contains 675391 words (capacity 1777798 words)
0 THE 39331

7 ON 3673
8 DUTY 72
9 CIVIL 27
counted 675391 words, score=2828 attempts
naive duration_sec 8.31
0 EBOOKS 28

7 OUR 1157
8 SUBSCRIBE 4
9 HOW 971
counted 675391 words, score=1845 attempts
heur duration_sec 7.82

The recommendations translate to a 14% improvement.

 

Looking for new recommendations

Let us now inspect the new MAQAO report for additional improvement suggestions.

Section 1.3 – Potential Speedups does not report anything.

Section 2.1 – Stylizer (idem).

Section 2.2 – Strategizer (idem).

Section 2.3 – Optimizer mentions a few interesting recommendations, but unfortunately they point to loop IDs. To map these ids to the source code we must refer to a table in 5.1 – Top 10 Loops,
reproduced below:

Loop ID Module Source Location Coverage (%)
18 search-O3.out search.c:41–42 2.62
15 search-O3.out search.c:76–77 1.92
12 search-O3.out search.c:74–80
search.c:84–86
search.c:90–90
0.03

 

Referring back to the source code, we observe that:

  • Loop id 18 is the innermost loop of the naive function.
  • Loop id 15 is the innermost loop of the heur function.
  • Loop id 12 includes the memmove call, and the append operation. It has a relatively small impact however.

The interesting recommendations for loop IDs 18 and 15 are:

  • Presence of (function) calls.
    Looking at section 4.1 – Top 10 Functions we see that one function is dominating usage: __strcmp_avx2 with 89.47% of the time. This obviously is our word comparison function. Although we could try to accelerate this, for the purpose of this How-To guide, we will consider it as out of scope (otherwise we could switch the libc standard implementation with a custom string comparison function, or switch to another strncmp implementation).
  • Presence of constant non unit stride data access.
    This relates to our use of arrays instead of pointers which introduce an additional indirection. As mentioned in the presentation of the sequential search, we traded the disadvantage of arrays for the simplicity and the fewer memory allocations (a system call) by using a memory pool. So, not a relevant option in our case.

So, no clear improvement is proposed so far.Looking further at the static code analysis (which MAQAO refers to as CQA), which is also presented by loop ID, the following recommendations are proposed:

  •  Try another compiler or update/tune your current one: recompile with fassociative-math (included in Ofast or ffast-math) to extend loop vectorization to FP reductions.
  • Remove inter-iterations dependences from your loop and make it unit-stride:

As we do not use any floating point operations, and have settled for array indirections, there is still nothing actionable.Reading forward, we do notice the following recommendations for loop id 12:

  • memmove@plt: 1 occurrences<<list_path_3_call_1>>
  • Detected COMPLEX INSTRUCTIONS

Although this relates to a loop with a low coverage, it is actionable, as we will see next.

MAQAO run 3

Applying previous recommendations

To improve on the performance of loop id 12, we aim to reduce the calls to memmove, while keeping the move to top heuristic.

A possible change is to keep the array of word counts as in function heur, but to start the search from the end of the array, instead of the beginning. This way, adding a new word simply means lling in the next available slot (provided by the memory pool), and updating the top of list, thus eliminating one memmove call for each new word encountered.

This change is implemented in the new function heurinv (for heuristic inversed) is presented below (line 133 highlights the change):

126 static struct result3 heurinv(char **word, int n) { 
127     int page = 1024; 
128     struct record *rec = malloc(page * sizeof(*rec)); 
129     int i, top=-1, score = 0; 
130     struct record r; 
131     for (int iw=0; iw<n; ++iw) { 
132         // the basic sequential search: 
133         for (i=top; i>=0; i--) { 
134             if (strcmp(word[iw], rec[i].word) == 0) { 
135                 rec[i].count++; 
136                 score += top-i; 
137                 if (i == top) 
138                     break; 
139                 // move to top: 
140                 r = rec[i]; 
141                 memmove(rec+i, rec+i+1, (top-i)*sizeof(struct record)); 
142                 rec[top] = r; 
143                 break; 
144             } 
145         } 
146         // word not found, add it to our list 
147         if (i<0) { 
148             // last record, extend storage: 
149             if (top >= page-1) { 
150                 page *= 2; 
151                 rec = realloc(rec, page*sizeof(*rec)); 
152             } 
153             // place on top: 
154             top++; 
155             rec[top] = (struct record) {word[iw], 1}; 
156             score += top; 
157         } 
158     } 
159     return (struct result3) {top, rec, score/n}; 
160 }

Adding our new heuristic function to the previous two functions in the application, yields the following execution output:

$ ./search-O3.out ALL.txt 5
read 3472KiB (4096KiB allocated) from ALL.txt
ALL.txt parsed, contains 675391 words (capacity 1777798 words)
0 THE 39331

7 ON 3673
8 DUTY 72
9 CIVIL 27
counted 675391 words, score=2828 attempts
naive duration_sec 8.31
0 EBOOKS 28

7 OUR 1157
8 SUBSCRIBE 4
9 HOW 971
counted 675391 words, score=1845 attempts
heur duration_sec 7.82
28264 EBOOKS 28

28257 OUR 1157
28256 SUBSCRIBE 4
28255 HOW 971
counted 675391 words, score=1845 attempts
heurinv duration_sec 7.67

 

The new heuristic heurinv is faster than the previous heur, and so even more than the original naive search.

The MAQAO guided change translates to a 16% improvement.

This reflects the low coverage reported by MAQAO.

This concludes our practical use case with MAQAO

Additional remarks

There are other interesting MAQAO features, but to keep this How-To guide short, we will just mention them now.

The MAQAO runs presented in this How-To guide were fast enough to be executed interactively, however MAQAO does support the use of SLURM’s sbatch.

MAQAO also supports the comparison of different reports. In this How-To guide, we could have used this feature to compare the search functions heurinv to heur.

Finally, since MAQAO works on binary executables, other source languages could have been benchmarked from their executables, such as Rust, Golang or Ocaml (compiled). This feature of MAQAO is also helpful for compiler developers: different optimizations can be analyzed from the binary file generated.

Conclusion

Hopefully, this short and practical use case was beneficial and introduced the following:

  • MAQAO is very straightforward to use in your performance tuning activity (after choosing the minimum, useful parameters)
  • even simple approaches such as the sequential search can be insightful!
  • Knuth’s Art of Computer Programming is worth opening every once in a while : )

Appendix

Producing input data files

To simplify the loading of the input data (relying on libc’s getline and strsep), we must convert a raw text le (such as downloaded from project Gutenberg) to a le with a single line of space separated, upper-case words. The following command can perform this:
cat thoreau.txt | tr -s -c [:alnum:] " " | tr a-z A-Z > THOREAU.txt

 

Move to top

The heuristic introduced was quali ed as appropriate for many real-world data sets. Move to top can be seen as yet another instance of a general heuristic: what happens next is often what happened recently. Another instance is the Lindy effect.
One caveat, as N. Taleb observed, this heuristic applies well to natural, physical phenonema. It may not apply to artificial, information world.
Finally, G. Gigerenzer has studied and published many results on heuristics.

 

Full source code (search.c)

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <errno.h>
#include <time.h>
#include <stdbool.h>
// —— basic sequential search ——
// store word counts in an array
// add unseen words at the end
struct record {
    char *word;
    int count;
};
struct result {
    int end;
    struct record *rec;
    int score;
};
static void show_result(struct result *r, int n) {
    int total = 0;
    for (int i=0; i <= r->end; ++i) {
        total += r->rec[i].count;
    }
    for (int i=0; i <= r->end && i<n; i++) {
        printf(“%d %s %d\n”, i, r->rec[i].word, r->rec[i].count);
    }
    printf(“counted %d words, score=%d attempts\n”, total, r->score);
}
static struct result naive(char **word, int n) {
    int page = 1024;
    struct record *rec = malloc(page * sizeof(*rec));
    int i, end=-1, score = 0;
    for (int iw=0; iw<n; ++iw) {
// the basic sequential search:
        for (i=0; i<=end; i++) {
            if (strcmp(word[iw], rec[i].word) == 0) {
                rec[i].count++;
                score += i;
                break;
            }
        }
// word not found, add it to our list
        if (i > end) {
// last record, extend storage:
            if (end >= page-1) {
                page *= 2;
                rec = realloc(rec, page*sizeof(*rec));
            }
            end++;
            rec[end] = (struct record) {word[iw], 1};
            score += end;
        }
    }
    return (struct result) {end, rec, score/n};
}
// —— heuristic sequential search ——
// move to top
// store word counts in an array
// move hits to the top
// add new, unseen, words at the top
static struct result heur(char **word, int n) {
    int page = 1024;
    struct record *rec = malloc(page * sizeof(*rec));
    int i=0, end=-1, score = 0;
    struct record r;
    for (int iw=0; iw<n; ++iw) {
// the basic sequential search:
        for (i=0; i <= end; i++) {
            if (strcmp(word[iw], rec[i].word) == 0) {
                rec[i].count++;
                score += i;
                if (i==0)
break;
// move to top:
                r = rec[i];
                memmove(rec+1, rec, i*sizeof(struct record));
                rec[0] = r;
                break;
            }
        }
// word not found, add it to our list
        if (i > end) {
// last record, extend storage:
            if (end >= page-1) {
                page *= 2;
                rec = realloc(rec, page*sizeof(*rec));
            }
// move to top:
            memmove(rec+1, rec, i*sizeof(struct record));
            rec[0] = (struct record) {word[iw], 1};
            end++;
            score += end;
        }
    }
    return (struct result) {end, rec, score/n};
}

// —— heuristic sequential search ——

// As ‘heur’ but store in stack (inverse order)

struct result3 {
    int top;
    struct record *rec;
    int score;
};
static void show_result3(struct result3 *r, int n) {
    int total = 0;
    for (int i=r->top; i>=0; i–) {
        total += r->rec[i].count;
    }
    for (int i=r->top, j=0; i>=0 && j<n; i–, j++) {
        printf(“%d %s %d\n”, i, r->rec[i].word, r->rec[i].count);
    }
    printf(“counted %d words, score=%d attempts\n”, total, r->score);
}
static struct result3 heurinv(char **word, int n) {
    int page = 1024;
    struct record *rec = malloc(page * sizeof(*rec));
    int i, top=-1, score = 0;
    struct record r;
    for (int iw=0; iw<n; ++iw) {
// the basic sequential search:
        for (i=top; i>=0; i–) {
            if (strcmp(word[iw], rec[i].word) == 0) {
                rec[i].count++;
                score += top-i;
                if (i == top)
break;
// move to top:
                r = rec[i];
                memmove(rec+i, rec+i+1, (top-i)*sizeof(struct record));
                rec[top] = r;
                break;
            }
        }
// word not found, add it to our list
        if (i<0) {
// last record, extend storage:
            if (top >= page-1) {
                page *= 2;
                rec = realloc(rec, page*sizeof(*rec));
            }
// place on top:
            top++;
            rec[top] = (struct record) {word[iw], 1};
            score += top;
        }
    }
    return (struct result3) {top, rec, score/n};
}
int main(int argc, char *argv[])
{
    if (argc < 2) {
        fprintf(stderr, “missing text file argument, optional iterations\n”);
        exit(1);
    }
    int iterations = (argc == 3 ? atoi(argv[2]) : 1);
    if (iterations <= 0) {
        fprintf(stderr, “invalid iterations, expecting integer\n”);
        exit(1);
    }
    FILE *fs = fopen(argv[1], “r”);
    if (!fs) {
        fprintf(stderr, “couldn’t open %s: %s\n”, argv[1], strerror(errno));
        exit(2);
    }
    char *line0 = NULL;
    size_t allocd = 0;
    ssize_t len = getline(&line0, &allocd, fs);
    printf(“read %ldKiB (%luKiB allocated) from %s\n”, len/1024, allocd/1024, argv[1]);
    if (len == -1) {
        fprintf(stderr, “couldn’t read line %s\n”, strerror(errno));
        exit(3);
    }
    fclose(fs);
    size_t nwords = len/2;
    char **word = calloc(nwords, sizeof(*word));
    int i=0;
// keep the original pointer, before strsep modifies it:
    char *line = line0;
    for (char *token; (token = strsep(&line, ” “)) && i < nwords;) {
        if (!*token || *token==’\n’) continue;
        word[i++] = token;
    }
    printf(“%s parsed, contains %d words (capacity %lu words)\n”, argv[1], i, nwords);

// run sequential search functions multiple times

    struct timespec tv0, tv1;
    struct result count[iterations];
    clock_gettime(CLOCK_MONOTONIC, &tv0);
    for (int j=0; j<iterations; j++) {
        count[j] = naive(word, i);
    }
    clock_gettime(CLOCK_MONOTONIC, &tv1);
    show_result(count, 10);
    for (int j=0; j<iterations; j++) {
        free(count[j].rec);
    }
    fprintf(stderr, “naive duration_sec %.2f\n”,
            (tv1.tv_sec – tv0.tv_sec + (double)(tv1.tv_nsec – tv0.tv_nsec)/1000000000.0)/iterations);
    clock_gettime(CLOCK_MONOTONIC, &tv0);
    for (int j=0; j<iterations; j++) {
        count[j] = heur(word, i);
    }
    clock_gettime(CLOCK_MONOTONIC, &tv1);
    show_result(count, 10);
    for (int j=0; j<iterations; j++) {
        free(count[j].rec);
    }
    fprintf(stderr, “heur duration_sec %.2f\n”,
            (tv1.tv_sec – tv0.tv_sec + (double)(tv1.tv_nsec – tv0.tv_nsec)/1000000000.0)/iterations);
    struct result3 count3[iterations];
    clock_gettime(CLOCK_MONOTONIC, &tv0);
    for (int j=0; j<iterations; j++) {
        count3[j] = heurinv(word, i);
    }
    clock_gettime(CLOCK_MONOTONIC, &tv1);
    show_result3(count3, 10);
    for (int j=0; j<iterations; j++) {
        free(count3[j].rec);
    }
    fprintf(stderr, “heurinv duration_sec %.2f\n”,
            (tv1.tv_sec – tv0.tv_sec + (double)(tv1.tv_nsec – tv0.tv_nsec)/1000000000.0)/iterations);
    free(line0);
    free(word);
    exit(0);
}
</code></pre>