Creating srcML

Let’s start off by making a simple srcML unit. We have a C++ file called rotate.cpp. To transform this file to the srcML format, execute srcml on the command line with the path to the file.

$ srcml rotate.cpp

Because we have not specified an output location, srcml will output straight to the command line. You should see the contents of rotate.cpp marked up in the srcML format. If you used the contents of the file as shown in the provided link, you will see something like this:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<unit xmlns="" xmlns:cpp="" revision="0.9.5" language="C++" filename="rotate.cpp">
<cpp:include>#<cpp:directive>include</cpp:directive> <cpp:file>"rotate.h"</cpp:file></cpp:include>

<comment type="line">// rotate three values</comment>
<function><type><name>void</name></type> <name>rotate</name><parameter_list>(<parameter><decl><type><name>int</name><modifier>&</modifier></type> <name>n1</name></decl></parameter>, <parameter><decl><type><name>int</name><modifier>&</modifier></type> <name>n2</name></decl></parameter>, <parameter>><decl><type><name>int</name><modifier>&</modifier></type> <name>n3</name></decl></parameter>)</parameter_list>
  <comment type="line">// copy original values</comment>
  <decl_stmt><decl><type><name>int</name></type> <name>tn1</name> <init>= <expr><name>n1</name></expr></init></decl>, <decl><type ref="prev"/><name>tn2</name> <init>= <expr><name>n2</name></expr></init></decl>, <decl><type ref="prev"/><name>tn3</name> <init>= <expr><name>n3</name></expr></init></decl>;</decl_stmt>

  <comment type="line">// move</comment>
  <expr_stmt><expr><name>n1</name> <operator>=</operator> <name>tn3</name></expr>;</expr_stmt>
  <expr_stmt><expr><name>n2</name> <operator>=</operator> <name>tn1</name></expr>;</expr_stmt>
  <expr_stmt><expr><name>n3</name> <operator>=</operator> <name>tn2</name></expr>;</expr_stmt>

You may be wondering, what happened to the original source-code? No worries, it’s all still there. If you were to remove all of the elements and attributes from this output, keeping only the text, you would see it matches the original source-code exactly. The extra content shown here are srcML tags, which identify the different syntactic parts of the source-code. srcml detects that rotate.cpp was written in C++ based on its cpp extension, and it leverages the knowledge of C++ syntax to identify the various elements of the code.

For example, we can break down the preprocessor statement #include "rotate.h" down to the include directive and file that was included. These are captured in the tags <cpp:include>, <cpp:directive>, and <cpp:file>. Notice that the contents of <cpp:include> contain the entire include statement, while <cpp:directive> contains only include, and <cpp:file> contains only the file name. This hierachical representation carries across all elements, such that all source-code within a block are contained within the <block> tags, all source-code within a function definition is contained within the <function> tags, all source-code within an expression is contained within the <expr> tags, etc. This allows for exploration and manipulation on a syntactic and hierarchical level.

You may also notice some extra attributes at the top of the srcML file denoting the name of the file that was parsed, its language, and a revision. In these examples, the alpha version 0.9.5 of srcml was used, which is why the revision attribute contains “0.9.5”. Note that your results may vary slightly if you are using a different version.

Outputting to a file can be accomplished with the -o option, or by piping standard output to a file. For example, the following command will output the resulting srcML format to a rotate.xml file. The xml extension is important in order to parse it as a srcML file later, and all srcML is valid XML.

$ srcml rotate.cpp -o rotate.xml

Now that we have a srcML file to work with, let’s transform it back to source code! This is easily accomplished the same way that we transformed it to srcML, by giving srcml the name of the file.

$ srcml rotate.xml

Similarly, an output file such as rotate.cpp can be specified with the output option, or if left without, srcml will output to standard out. Here, srcml knows to transform from the srcML format to source-code because of the xml extension on the input file.

The same commands apply for project directories, archives, and compressed files. Let’s give it a try with an example C++ project called narq, which has four files in it. Download the tar.gz for the project here, and run srcml on it with the following command:

$ srcml --verbose narq.tar.gz -o narq.tar.gz.xml

Because we ran it with the verbose option, we can see each of the files that were parsed along with some information about each file. This is helpful for verification, and as a progress indicator, when running srcml on a project directory or archive that contains many files. Let’s talk about the verbose output first, which looks like:

Source encoding:  (null)
XML encoding:  UTF-8
    - narq/Makefile
    1 narq/tools.hpp    C++ 32  49fb5024a2700e074451128b0e19ddb05af4245c
    2 narq/tools.cpp    C++ 84  e8e277111570f5180733833b0859e61cf423813e
    3 narq/main.cpp     C++ 89  3bbc0e1c47fe5b5c79843e34e0c2d6b35b56d5ab

Translated: 3   Skipped: 1  Error: 0    Total: 4

Some basic summary information is displayed here, including a list of the files provided to srcml, the programming language that file was written in, the number of lines in the file, and a unique SHA-1 hash computed based on the contents of the source-code file. At the end, we see that all but one of the files from our example project were parsed. This is because any file from a directory or archive that does not have a supported source-code file extension is skipped, such as XML, HTML, or Make files, and files written in programming languages that srcml does not support. The programming languages that srcml does support are files written in C, C++, C#, and Java.

Now let’s look at narq.tar.gz.xml, where we output the srcML file. There’s a lot of information here, but to summarize you should see the output follow the format:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<unit xmlns="" revision="0.9.5" url="narq.tar.gz">

<unit xmlns:cpp="" revision="0.9.5" language="C++" filename="narq/tools.hpp" hash="49fb5024a2700e074451128b0e19ddb05af4245c">

<unit xmlns:cpp="" revision="0.9.5" language="C++" filename="narq/tools.cpp" hash="e8e277111570f5180733833b0859e61cf423813e">

<unit xmlns:cpp="" revision="0.9.5" language="C++" filename="narq/main.cpp" hash="3bbc0e1c47fe5b5c79843e34e0c2d6b35b56d5ab">


Where each ... contains srcML based on the content of that file. Notice that all of the files are condensed into one srcML archive. At the root unit, the src namespace appears, which is used to mark up all srcML files regardless of language. We also have a new attribute, url, which shows the name of the original tar.gz containing the source-code. Similarly to the simple srcML unit we made when we ran srcml on one file, for each file parsed there is exists a srcML unit with the namespaces used for the tags that markup the source-code file, the revision of srcml used to created the unit, the language of the file, and the name of the original source-code file.

In this example, we used a tar.gz file, but any tar, bz2, gz, zip, or cpio compression/archive input format is also supported. In addition, if we were to decompress narq.tar.gz to its original directory structure, we can also give srcml the narq directory and expect the same output.

We can also go from this srcML archive format back to the original source-code. A major difference here is that there are multiple files, so we have two options: 1) convert just one of the units back to source-code, or 2) convert all of the units back to source-code in their original directory heirarchy.

With the first option, we provide srcml with a specific unit to convert back to source-code. This is done with the unit option, which takes a unit number to extract. For example,

$ srcml --unit 1 narq.xml

will extract the first unit in the srcML archive and convert it back to source-code, writing to standard output. This is most useful when we only want one of the files or for testing the result of a transformation.

For the second option, we instruct srcml to extract all of the source-code files by recreating the original directory structure. This is accomplished with the to-dir option, which takes a path where the original directory structure will be recreated. This structure is based on the filename attribute for each of the srcML units. This means that files that have been skipped by srcml are not recreated, and srcML units whose filename attribute has been modified by some transformation will be recreated in the likeness of the new filename.

To demonstrate, let us extract the source-code from narq.xml with the following command:

$ srcml --to-dir . narq.xml

Here we’re instructing srcml to extract the source-code contents of narq.xml to the current directory. Apart from the directories in the project that will be extracted, srcml will not create a directory if it doesn’t exist. This means that the path given to the to-dir option must exist. As a result of this operation, you will see the following directory structure in the current directory, where the contents of the files match the original source-code:


Now that you can create srcML and extract it back to source, let’s cover some queries and manipulations that can be performed on srcML.