Getting the AST and IR from clang/LLVM

clang is a frontend compiler for the LLVM toolchain. The basic idea is that LLVM operates at the Intermediate Representation (IR) produced by a frontend, e.g., clang, and is therefore responsible of the low level code compilation. When a compilation process takes place, usually the source code is transformed in an AST Abstract Syntax Tree that represent a tree-based form of the source code, with all the possible branches and nodes (e.g., variable declarations, assignments, function calls and so on). This form allows an automated processing that can udnerstand how to build the next level down in the compilation phase. clang allows the dumping of the AST in a semi-human readable format. Let’s start with a very simple C code:

#include <stdio.h>

int
main( void )
{
  printf( "Hello World!\n" );
}



The above is the very basi “Hello World” program. Using the options --ast-dump and Xclang it is possible to get a textual representation of what the AST will look like:

% clang -Xclang -ast-dump test.c


The tree, even for a program as short as “Hello World” is quite huge, and that is due to the loading of all the symbols in stdio.h. The end of the tree, however, depicts the program itself:

-FunctionDecl 0x61e36cd03548 <test.c:5:1, line:9:1> line:6:1 main 'int (void)'
  `-CompoundStmt 0x61e36cd03730 <line:7:1, line:9:1>
    `-CallExpr 0x61e36cd036d8 <line:8:3, col:28> 'int'
      |-ImplicitCastExpr 0x61e36cd036c0 <col:3> 'int (*)(const char *, ...)' <FunctionToPointerDecay>
      | `-DeclRefExpr 0x61e36cd035f0 <col:3> 'int (const char *, ...)' Function 0x61e36ccc3c98 'printf' 'int (const char *, ...)'
      `-ImplicitCastExpr 0x61e36cd03718 <col:11> 'const char *' <NoOp>
        `-ImplicitCastExpr 0x61e36cd03700 <col:11> 'char *' <ArrayToPointerDecay>
          `-StringLiteral 0x61e36cd03650 <col:11> 'char[14]' lvalue "Hello World!\n"



The tree shows the definition of main, that has a few subnodes including the call to printf that accepts a const char * (implicit cast) applied to a char * that is the constant literal "Hello World!\n". Let’s change the program a little, to do a simple loop and variable interpolation:

#include <stdio.h>


int
main( void )
{
  for ( int i = 0; i < 3; i++ )
    printf( "%d) Hello World!\n", i );

  return -1;
}



In this case the tree gets even larger:

FunctionDecl 0x5b4b352395f8 <test.c:5:1, line:12:1> line:6:1 main 'int (void)'
  `-CompoundStmt 0x5b4b35239a20 <line:7:1, line:12:1>
    |-ForStmt 0x5b4b352399a0 <line:8:3, line:9:37>
    | |-DeclStmt 0x5b4b35239740 <line:8:9, col:18>
    | | `-VarDecl 0x5b4b352396b8 <col:9, col:17> col:13 used i 'int' cinit
    | |   `-IntegerLiteral 0x5b4b35239720 <col:17> 'int' 0
    | |-<<<NULL>>>
    | |-BinaryOperator 0x5b4b352397b0 <col:20, col:24> 'int' '<'
    | | |-ImplicitCastExpr 0x5b4b35239798 <col:20> 'int' <LValueToRValue>
    | | | `-DeclRefExpr 0x5b4b35239758 <col:20> 'int' lvalue Var 0x5b4b352396b8 'i' 'int'
    | | `-IntegerLiteral 0x5b4b35239778 <col:24> 'int' 3
    | |-UnaryOperator 0x5b4b352397f0 <col:27, col:28> 'int' postfix '++'
    | | `-DeclRefExpr 0x5b4b352397d0 <col:27> 'int' lvalue Var 0x5b4b352396b8 'i' 'int'
    | `-CallExpr 0x5b4b35239928 <line:9:5, col:37> 'int'
    |   |-ImplicitCastExpr 0x5b4b35239910 <col:5> 'int (*)(const char *, ...)' <FunctionToPointerDecay>
    |   | `-DeclRefExpr 0x5b4b35239808 <col:5> 'int (const char *, ...)' Function 0x5b4b351f9d48 'printf' 'int (const char *, ...)'
    |   |-ImplicitCastExpr 0x5b4b35239970 <col:13> 'const char *' <NoOp>
    |   | `-ImplicitCastExpr 0x5b4b35239958 <col:13> 'char *' <ArrayToPointerDecay>
    |   |   `-StringLiteral 0x5b4b35239870 <col:13> 'char[18]' lvalue "%d) Hello World!\n"
    |   `-ImplicitCastExpr 0x5b4b35239988 <col:35> 'int' <LValueToRValue>
    |     `-DeclRefExpr 0x5b4b352398a0 <col:35> 'int' lvalue Var 0x5b4b352396b8 'i' 'int'
    `-ReturnStmt 0x5b4b35239a10 <line:11:3, col:11>
      `-UnaryOperator 0x5b4b352399f8 <col:10, col:11> 'int' prefix '-'
        `-IntegerLiteral 0x5b4b352399d8 <col:11> 'int' 1




Note the nodes required to define and initialize the variable i, that are DeclStmt, VarDecl, IntegerLiteral. The for loop has a main ForStmt that includes the above nodes, the BinaryOperator (and subnodes) used to test the condition and the UnaryOperator (and subnodes) used to increment the variable. Note also the ReturnStmt that is made by two different nodes, one for the literal value and one to apply the minus sign. It is quite easy to understand how long and deep can quickly become a AST for a real world application. Just to get a raw size: the above C program has an AST around 888 lines out of 13 lines of code (including empty lines), while the pgagroal/src/cli.c file has an AST of 32338 lines out of 1159 lines of code (including comments)!

Getting a human readable IR from LLVM

The LLVM toolchain accepts a “human readable” IR (let’s call an HIR - High level IR) or a binary IR (let’s call it LIR - Low level IR). Given a source file, it is possible to ask LLVM to dump a human readable representation of the IR:

% clang -S -emit-llvm test.c


The above command will produce the file test.ll that is 51 lines out of the 13 original lines of C code:

; ModuleID = 'test.c'
source_filename = "test.c"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128"
target triple = "x86_64-pc-linux-gnu"

@.str = private unnamed_addr constant [18 x i8] c"%d) Hello World!\0A\00", align 1

; Function Attrs: noinline nounwind optnone uwtable
define dso_local i32 @main() #0 {
  %1 = alloca i32, align 4
  %2 = alloca i32, align 4
  store i32 0, ptr %1, align 4
  store i32 0, ptr %2, align 4
  br label %3

3:                                                ; preds = %9, %0
  %4 = load i32, ptr %2, align 4
  %5 = icmp slt i32 %4, 3
  br i1 %5, label %6, label %12

6:                                                ; preds = %3
  %7 = load i32, ptr %2, align 4
  %8 = call i32 (ptr, ...) @printf(ptr noundef @.str, i32 noundef %7)
  br label %9

9:                                                ; preds = %6
  %10 = load i32, ptr %2, align 4
  %11 = add nsw i32 %10, 1
  store i32 %11, ptr %2, align 4
  br label %3, !llvm.loop !6

12:                                               ; preds = %3
  ret i32 -1
}

declare i32 @printf(ptr noundef, ...) #1

attributes #0 = { noinline nounwind optnone uwtable "frame-pointer"="all" "min-legal-vector-width"="0" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="x86-64" "target-features"="+cmov,+cx8,+fxsr,+mmx,+sse,+sse2,+x87" "tune-cpu"="generic" }
attributes #1 = { "frame-pointer"="all" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="x86-64" "target-features"="+cmov,+cx8,+fxsr,+mmx,+sse,+sse2,+x87" "tune-cpu"="generic" }

!llvm.module.flags = !{!0, !1, !2, !3, !4}
!llvm.ident = !{!5}

!0 = !{i32 1, !"wchar_size", i32 4}
!1 = !{i32 8, !"PIC Level", i32 2}
!2 = !{i32 7, !"PIE Level", i32 2}
!3 = !{i32 7, !"uwtable", i32 2}
!4 = !{i32 7, !"frame-pointer", i32 2}
!5 = !{!"Ubuntu clang version 18.1.3 (1)"}
!6 = distinct !{!6, !7}
!7 = !{!"llvm.loop.mustprogress"}



The (H)IR above represents the stack based machine instructions that LLVM is going to compile further. The language is detailed in the documentation and can be used as a possible intermediate language to translate a different programming language to use LLVM as backend.

The article Getting the AST and IR from clang/LLVM has been posted by Luca Ferrari on June 20, 2024