This is part one of a two-part series that discusses working with XML data with an SQL Server backend. Part two is discussed here.

I’ve personally found XML files to be something of a bugbear. They’re going to require some kind of transformation to get it in a format that’s useful for importing or exporting. There are many, many methods to handle XML files but they can vary considerably and it can be confusing for someone new to handling XML. The information is also so scattered over the internet it’s not always easy to figure out what is really the best solution for your particular scenarios.

I’ll be addressing this need from two different places — inside Microsoft Access or inside SQL Server and SSIS. Generally, Access is useful when you need to interact with the files — that is, users need to be able to select the XML files directly or specify where the XML files should be saved to. SSIS is perfect when you want to totally automate the process and you don’t want to bother the user with dealing with loads and loads of XML files that are being sent in or sent out by an automated source.

Working with XML data efficiently 1 IT Impact Access Experts Ben Clothier Chicago

The biggest issue with XML handling in general, is that the number of steps to get from point A to point B can easily balloon if you are not careful. It’s tempting to want to load the XML file as it is into temporary tables, build queries to extract the relevant data and map it to the right columns than to actually insert it into production tables. There is nothing wrong with the process technically, but from a maintenance perspective, this is a nightmare. When you change the XML schema, you now have to update the temporary table’s structure, update the query to include the changes, and in some cases, update the production tables. That’s a lot! Same criticism can be applied to other methods such as using XSLTs to transform XML files into appropriate schema. Likewise, writing a routine to parse the XML’s content and extracting data are inherently going to be so highly coupled that any change in the XML schema could easily cascade into a maintenance nightmare.

A common approach I see is to just create a new text file and write XML out line by line, passing in the tag name along with the data. That’s one of worst approaches because now you’re hard-coding not only the data but also the schema! I can understand why some might want to do it – it is easy and doesn’t require becoming familiar with even more technologies. We’ve already had enough technologies and they’re all going obsolete so fast that learning a new tool is intrinsically a risky venture.

That said, I’ve come to the realization that there is a better way. Ultimately, the objective is to have most direct translation from SQL to XML and vice versa. Yes, we could use something like XSLT to do the transformation but we’d now be maintaining an XSLT stylesheet that would be obsolete when the XML schema changes. Likewise, I don’t like to write VBA or C# code to handle parsing of an XML document. No matter how you handle it, you are going to have to embed schema into the coding and that’s now one more step to maintain.

Now, when using SQL Server 2005 or newer, XML handling is built-in — you can generate an XML document using FOR XML PATH and you can load an XML document using OPENXML. As a result, you have a simple way to directly map SQL data to XML elements. Let’s consider how you can build an XML document. We’ll use this sample XML for the discussion:

<customers>
  <customer>
    <id>1</id>
    <name>John Smith</name>
    <status active="true" />
    <orders_stats>
      <count>3</count>
      <total currency="USD">100.00</total>
    </order_stats>
  </customer>
  <customer>
    <id>2</id>
    <name>Taylor Edwards</name>
    <status active="false" />
    <orders_stats>
      <count>7</count>
      <total currency="USD">380.00</total>
    </order_stats>
  </customer>
</customers>

Creating an XML document

As you can see, we have variations — we have nested elements (order_stats) and some data are attributes instead of values for some elements like status and total. Fortunately, those can be expressed in T-SQL. Let’s see how you can build it with T-SQL, assuming that we already have a view that provides all of the data we need to populate the XML document:

SELECT
  c.ID AS [id],
  c.FullName AS [name],
  (
    SELECT CASE WHEN c.Status = 0
           THEN N'false'
           ELSE N'true'
      END AS [@active]
    FOR XML PATH(N'status'), Type
  ),
  (
    SELECT
      c.CountOfOrders AS [count],
      (
        SELECT
          N'USD' AS [@currency],
          c.TotalAmount
        FOR XML PATH(N'total'), Type
      )
    FOR XML PATH(N'order_stats'), Type
  )
FROM dbo.vwCustomers AS c
FOR XML PATH(N'Customer'), ROOT(N'Customers');

Using only SQL, we are able to construct a fairly complex XML document. Because this is all done directly in SQL, any schema change needs to be done in exactly one place, via the ALTER statement(s). The output is a single XML document so there is no additional handling we need to worry about. That’s a tremendous advantage to have.

Let’s go over some clauses within the SELECT list and see how they work to assemble the XML element.

1. Defining attributes and default values

As you may notice, we can input data to attributes instead of element. The key is prefixing the column name with the @, as we did for the currency:

N'USD' AS [@currency]

Similarly, a column that has no explicit name is presumed to be a part of the enclosing element:

SELECT c.TotalAmount
FOR XML PATH(N'total')

However, ensure that there is only one column with no name. If there are more columns without name, they will be put in the same value for the total which may not be valid. Generally speaking, when building an element that needs attributes and a single value, it’s best to branch it into its own subquery so you can specify the attributes and value precisely and ensure that it is closed accordingly.

2. Nesting the element

With the order_stats element, we have two elements contained within it to give the count and the total. Since that’s below the level of the customer node, we can go down a level by using a subquery of this general form:

..., (SELECT ... FOR XML PATH(N'order_stats), ...

We don’t need the FROM clause for the subquery because we’re still selecting from the same view referenced in the outermost FROM clause. Now presume we are not supposed to include the order_stats for a brand new customer who has not made any orders yet. For such customer where the count and total are NULL, the SQL as shown would generate this fragment:

<order_stats>
  <total currency="USD" />
</order_stats>

Not pretty! Fortunately, by adding a WHERE clause, we can prevent the inclusion altogether:

SELECT
  c.CountOfOrders AS [count],
  (
    SELECT
      N'USD' AS [@currency],
      c.TotalAmount
    FOR XML PATH(N'total'), Type
  )
WHERE c.Count IS NOT NULL 
  AND c.TotalAmount IS NOT NULL
FOR XML PATH(N'order_stats'), Type

With the WHERE clause in the subquery, the output changes to NULL which ensures that no order_stats element will be generated for this customer.

Also note the use of Type after the XML PATH. If this is not included, the elements get written with escaping like this:

<customer>
  <order_stats>
    <total currency="USD" />
  </order_stats>
</customer>

Type signals that we’re building a fragment of XML and it should be embedded in bigger XML fragment as XML, and not as text.

3. Defining the root element

By default, FOR XML PATH generates an element (e.g. the ‘customer’ element) for each row. We can wrap all customer elements with a single ‘customers’ (plural) element. By using ROOT clause after the FOR XML PATH, we can do just that.

4. Including & excluding optional elements

What is not apparent is if a field has a NULL value, it won’t show up as an element. This is useful if you have to ensure that only certain elements are present. Suppose that you have two elements, <PersonalNumber> and <OrganizationNumber> — you must use only one from either, and it would be an error to have both elements. You can control the appearance with CASE statements as shown:

...
  CASE WHEN p.IsPerson = 0 THEN NULL ELSE p.ID END AS PersonalNumber,
  CASE WHEN p.IsPerson = 0 THEN p.ID ELSE NULL END AS OrganizationNumber,
...

This then enables you to build a valid XML document that uses only one of two elements per p.ID.

5. Including empty elements

Sometimes you have to put in empty elements — for example, <IsActive />. This can be achieved in a similar manner to what we saw with including & excluding elements but instead of using NULL, we pass in a zero-length string:

CASE WHEN p.IsActive THEN N'' ELSE NULL END AS IsActive

There are of course, more tricks but the material above should enable you to generate a wide variety of XML format already and doing so all in Transact-SQL.

What about reading it in?

Now, suppose that the sample XML is to be an input XML instead? How would we read it into T-SQL? Fortunately, we can still do direct XML to SQL just as easily. Here’s the SQL syntax required to re-create the structure that vwCustomers view originally had:

DECLARE @doc int, @xml;

--Assume we populate @xml somehow

EXEC sp_xml_preparedocument @doc OUTPUT, @xml;

SELECT *
FROM OPENXML(@doc, N'//customer')
WITH (
  CustomerID int N'id',
  FullName varchar(255) N'name',
  Active bit N'./status/@active',
  OrderCount int N'./order_stats/count',
  TotalAmount money N'./order_stats/total'
);

EXEC sp_xml_removedocument @doc;

With OPENXML, it is fairly straightforward — we basically select an element on which to base most of our work — the customer element. When we need to retrieve data from another node that’s nested deeper than the customer node, we can do so using XPATH. You can see that we were able to get count and total by navigating './order_stats/count' and we can also read attributes with the @ prefix.

A significant caveat: the XML elements are case-sensitive and you won’t get an error. This makes sense when you think about it. Recall how we talked about some customers that ought not to have any order_stats elements. This is still a valid XML document even though the order_stats element does not exist. In this case, SQL Server fills in NULL for OrderCount & TotalAmount fields without throwing up any errors. By the same token, if we mistype the element as NAME when it’s actually name, we’ll get NULL value for that field even though there were actually data in the name element. Therefore, take care when you are mapping the elements to fields.

But where did XML data come from?

So far, we’ve discussed how SQL Server can manipulate XML data but we haven’t really talked about how we’d get the XML data, especially when it comes from file(s) somewhere else. This will be covered in part two.