XPath Queries

The srcML format allows you to perform syntactic and heirarchical searches on source-code using XPath. This is done with the xpath option, where the argument is the XPath query to be performed. For example, let’s do a simple search to find all of the names of function definitions from the example project narq we used above.

$ srcml --xpath "//src:function/src:name" narq.xml

The above command performs an XPath query on the srcML file narq.xml. Specifically, it’s searching for a name tag that is an immediate child of a function tag, which could appear anywhere within the hierarchy of the srcML unit. To break it down, the query //src:function finds all of the src:function elements from each unit, while /src:name finds the src:name elements that are children of src:function elements found.

The result is:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<unit xmlns="http://www.srcML.org/srcML/src" revision="1.0.0">

<unit revision="1.0.0" language="C++" filename="narq/main.cpp" item="1"><name>main</name></unit>

<unit revision="1.0.0" language="C++" filename="narq/tools.cpp" item="1"><name>rabinKarpMulti</name></unit>

<unit revision="1.0.0" language="C++" filename="narq/tools.cpp" item="2"><name>rhash</name></unit>

<unit revision="1.0.0" language="C++" filename="narq/tools.cpp" item="3"><name>partition</name></unit>

</unit>

This shows that there were four function definitions, the names of which were rabinKarpMulti, rhash, partition, and main. The results are placed in the same format as a srcML archive - a root unit to contain all of the results, and each query result enclosed in its own unit element, which has attributes including the file in which it was found.

When constructing XPath queries, keep in mind that an element in the query cannot use a “default prefix”, e.g., function. Effectively this means to use the builtin prefix of src before most srcML elements, e.g., src:function, and cpp before any preprocessor elements, e.g., cpp:ifdef.

Because the output is in the format of a srcML archive, any operation that can be performed on a srcML file can also be performed on the output, such as additional queries or transformations. However, not all XPath results are made equal. Based on the query, the result could also be a boolean value, a string, or a number.

Let’s perform a search that will return a string. We will search for the filename attribute on each unit in our srcML archive narq.xml, using the command:

$ srcml --xpath "string(//src:unit/@filename)" narq.xml

Here, we are using XPath’s string() function to return the text from the filename attribute from each of the srcML units in the archive. The result is a list of the source-code files making up the srcML file:

narq/main.cpp
narq/tools.cpp
narq/tools.hpp

Notice the results aren’t contained within the srcML format. This is because the result of the query is a string rather than XML.

Let’s do another search on narq. I’d like to make sure it’s well documented, so let’s make a query to see what function declarations don’t have Doxygen. This query will be a bit longer, so we’ll build it up step by step. First off, let’s find the name of all the function declarations. We can do that with the command:

$ srcml --xpath '//src:function_decl/src:name' narq.xml

You should see a srcML archive containing the three function declarations, rabinKarpMulti, rhash, and partition. We don’t see the main function here, because it didn’t have a function declaration. All of the resulting functions come from tools.hpp.

Now let’s refine the query so that we’re finding all of the names of the function declarations that have doxygen before them. We’ll need to use XPath’s preceding-sibling to help us find sibling nodes of the src:function_decl element. Let’s try:

$ srcml --xpath '//src:function_decl[preceding-sibling::src:*[1]/@format="doxygen"]/src:name' narq.xml

Take care that single quotes are used to encapsulate the entire query since we’re using double quotes to check that the format attribute matches “doxygen”. The predicate that we’ve added is checking that the format attribute of the first preceding sibling node (of any type) of the function declaration has the string “doxygen”. The result of the query is:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<unit xmlns="http://www.srcML.org/srcML/src" revision="1.0.0">

<unit revision="1.0.0" language="C++" filename="narq/tools.hpp" item="1"><name>rabinKarpMulti</name></unit>

<unit revision="1.0.0" language="C++" filename="narq/tools.hpp" item="2"><name>partition</name></unit>

</unit>

So two functions, rabinKarpMulti and partition, have doxygen. Cool! But we don’t want to know what functions have already been documented, we want to know the functions that still need documentation. The next step is to tweak the query so that only the names of functions are returned if it doesn’t have doxygen:

$ srcml --xpath '//src:function_decl[not(preceding-sibling::src:*[1]/@format="doxygen")]/src:name' narq.xml

Now we’ll get only the function names that we were missing from our previous results:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<unit xmlns="http://www.srcML.org/srcML/src" revision="1.0.0">

<unit revision="1.0.0" language="C++" filename="narq/tools.hpp" item="1"><name>rhash</name></unit>

</unit>

Now I know all I have left to do to finish adding documentation is to add doxygen to the rhash function from narq/tools.hpp.