What is srcML?

The srcML format is an XML representation for source code, where the markup tags identify elements of the abstract syntax for the language. The srcml program is a command line application for the conversion source code to srcML, an interface for the exploration, analysis, and manipulation of source code in this form, and the conversion of srcML back to source code. The current parsing technologies supports C/C++, C#, and Java.

srcML Layer Diagram


Tools provided and custom built are used to query, extract data, and transform source code.


External models of the code such as PDG, UML, call graphs can be built in XML


The full range of XML technologies can be applied to the srcML format.


The srcml CLI is used to convert entire projects from and to source code and the srcML format. Languages supported include C, C++, Java, and C#.

srcML Format

The srcML format represents source code with all original information intact, including whitespace, comments, and preprocessing statements.

Using srcML

A number of underlying features make srcML particularly useful for evolution and maintenance. The main philosophy is to take a programmer-centric view of the code rather than a compiler-centric one. First, the conversion from source code to srcML is lossless. That is, no formatting, comments, or actual code is lost. There is a round-trip equivalency from source code to srcML and back to the original source code. Additionally, macros, templates, and preprocessor statements are marked up. That is, the preprocessor is not run (or need not be run) prior to conversion to srcML. This also implies that code with missing includes, libraries, or code fragments can be converted to well-formed srcML. Lastly, the conversion to srcML is extremely efficient, running faster than a compiler.

Here is a simple C++ program, rotate.cpp:

#include "rotate.h"

// rotate three values
void rotate(int& n1, int& n2, int& n3) {

  // copy original values
  int tn1 = n1, tn2 = n2, tn3 = n3;

  // move
  n1 = tn3;
  n2 = tn1;
  n3 = tn2;

and the corresponding srcML version, rotate.xml:

<cpp:include>#<cpp:directive>include</cpp:directive> <cpp:file>"rotate.h"</cpp:file></cpp:include>

<comment type="line">// rotate three values</comment>
<function><type><name>void</name></type> <name>rotate</name><parameter_list>(<parameter><decl><type><name>int</name><modifier>&amp;</modifier></type> <name>n1</name></decl></parameter>, <parameter><decl><type><name>int</name><modifier>&amp;</modifier></type> <name>n2</name></decl></parameter>, <parameter><decl><type><name>int</name><modifier>&amp;</modifier></type> <name>n3</name></decl></parameter>)</parameter_list> <block>{<block_content>

  <comment type="line">// copy original values</comment>
  <decl_stmt><decl><type><name>int</name></type> <name>tn1</name> <init>= <expr><name>n1</name></expr></init></decl>, <decl><type ref="prev"/><name>tn2</name> <init>= <expr><name>n2</name></expr></init></decl>, <decl><type ref="prev"/><name>tn3</name> <init>= <expr><name>n3</name></expr></init></decl>;</decl_stmt>

  <comment type="line">// move</comment>
  <expr_stmt><expr><name>n1</name> <operator>=</operator> <name>tn3</name></expr>;</expr_stmt>
  <expr_stmt><expr><name>n2</name> <operator>=</operator> <name>tn1</name></expr>;</expr_stmt>
  <expr_stmt><expr><name>n3</name> <operator>=</operator> <name>tn2</name></expr>;</expr_stmt>

The srcml tool is used to efficiently convert source code files into the srcML format with a translation speed of 25 KLOCS/sec and approximately 3,000 files/minute. For example, the entire Linux kernel can be converted into the srcML format in less than seven minutes. This tool is robust in that it handles unpreprocessed and incomplete code. Once in srcML, XML tools and technologies can be used for such things as fact extraction and transformation. This includes the use of XPath and XQuery for fact extraction, RelaxNG and XSchema for validation, and XSLT, DOM, and SAX for transformation. The srcml tool handles the translation from srcML back to source code with speeds over 250 KLOCs/sec.

srcML has been used for a variety of maintenance problems. This includes, but is not limited to, the analysis of large systems to automatically reverse engineer class and method stereotypes, supporting syntactic differencing, and applying transformations to support API and compiler migration.

See the tutorials and documentation for more details on usage.