This is part two of our two-part series on working with XML data with a SQL Server backend. Part one can be found here.
In part one, we showed how much simpler and effective it was to manipulate XML data by doing it directly in the SQL Server layer as opposed to front-end or middle-layer code. The inherent problem with manipulating XML is that we have to deal with both data and schema and we really don’t want to embed schema into the code or build components to deal with schema and thus complicate the maintenance. We looked at how we can build and read complex XML documents from SQL. However, we now need to figure out how we can get XML data from a file or files into SQL Server. This is the part two.
How do I get it in or out of SQL Server?!?
Sometimes, we have to deal with XML files and we can’t exactly pass a file to the SQL Server. Furthermore, for security reasons, it can be impractical to tell SQL Server where to get to the file. For these reasons, it’s more likely that you’ll have to load or create the XML file via either Access or SSIS which can more easily interact with the filesystem.
We’ll talk about how we can handle the files either using Access as the front-end or using SSIS for automated workflow but let’s discuss the general process that will apply to both Access and SSIS in terms of handling the XML documents. As mentioned at the start, the ultimate objective is to write code that will not require any further maintenance if the XML schema changes. The only thing that Access and SSIS both should do is ultimately push the XML document into SQL Server and let it do all the parsing. Therefore, we want to treat the input XML document as a black box to the front-end and middleware. Neither SSIS nor Access has a native XML data type that is directly compatible and equivalent to the SQL Server’s XML data type. Therefore, to avoid the pipeline mangling of the XML document as it travels through, we will treat all input XML documents as varbinary(MAX)
and all XML documents returned by SQL Server as nvarchar(MAX)
.
This may seem a weird thing to do – why would we use two different data types for incoming and outgoing? The answer is that if we try to open an XML document as a text file, we may be applying wrong encoding to it and as a result, the pipeline will be sending in gibberish. By opening it as a binary file, neither SSIS nor Access makes any assumptions about encoding and leaves it totally alone, deferring the final decoding of the content to the SQL Server’s XML parser; which is exactly what we want. Within the C# scripts, we could use XML datatype which is exposed via the SqlClient
and when we are using an ADO.NET connection. However, in this case, flexibility is preferred so we are not using the XML datatype but rather a more generic datatype that is widely supported by various client components such as OLEDB or ODBC and not just SQL Server’s ADO.NET provider. For both Access and SSIS, we’ll use this SQL stored procedure to read input XML:
CREATE PROCEDURE dbo.uspReadXML( @binXML varbinary(MAX) ) AS BEGIN DECLARE @doc int, @xml xml = CAST(@binXML AS xml); EXEC sp_xml_preparedocument @doc OUTPUT, @xml; SELECT ... FROM OPENXML( ... ) EXEC sp_xml_removedocument @doc; END;
Writing out the output is not as simple because SQL Server will always output the XML without any processing instructions and always in a single encoding. This may actually be unacceptable since you may be required to encode the XML in something different and sometimes the ultimate consumer of the XML file you’re generating requires that you have a processing instruction at the head of the XML file. Therefore, we must deal with the encoding problem inside the VBA or C# code. This is not as bad as it seems since it’s still possible to parameterize the code responsible for encoding by passing in a parameter along with the XML document so that you still can control the output from the SQL Server by setting the parameter for the encoding accordingly.
As it happens, both string data types used by MS Access VBA and SSIS (and C#) are also the same encoding. Therefore, we can just cast the XML document into a string so that both VBA and C# are able to read the string intelligently, load it into a proper XML document and then in that layer, we can specify which encoding we want it to be and let that layer do its magic and save a properly encoded XML file without any knowledge or concern as to its contents beyond the encoding.
This is how we may construct a stored procedure for the purpose of writing out XML files:
CREATE PROCEDURE dbo.uspWriteXML AS BEGIN DECLARE @xml xml; -- Assume we populate @xml somehow... SELECT XMLFileName, XMLMessage, XMLEncoding FROM (VALUES (N'MyMessage.xml', CAST(@xml AS nvarchar(MAX)), N'utf-8') ) AS x(XMLFileName, XMLMessage); END;
In this illustration, we only have one row returned but the design should illustrate how you can easily generate multiple XML files from a single stored procedure and therefore simplify the client-side programming by looping over the resultset returned.
Using Access
To read XML files and send the content into a SQL stored procedure, we’ll need to use ADO and MSXML. For simplicity, the examples will use early-binding and you of course can revise the code to use late-binding. To read in XML files and send it to SQL Server, we’ll use the uspReadXML
procedure. To read the XML file in, we’ll use ADO’s Stream object and open it as a binary then read the file as supplied. Here’s the minimum VBA code to do this:
Public Sub ReadXML(XMLFilePath As String) Dim con As ADODB.Connection Dim cmd As ADODB.Command Dim stm As ADODB.Stream Set con = New ADODB.Connection Set cmd = New ADODB.Command Set stm = New ADODB.Stream With con .ConnectionString = "..." .Open End With With stm .Type = adTypeBinary .Open .LoadFromFile XMLFilePath End With With cmd Set .ActiveConnection = con .CommandType = adCmdStoredProc .CommandText = "dbo.uspReadXML" .Parameters.Append _ .CreateParameter("@binXML", adVarBinary, adParamInput, -1) .Parameters("@binXML").Value = stm.Read .Execute End With End Sub
If you want to generate XML files from a stored procedure, you can use uspWriteXML
stored procedure with an ADODB.Recordset
and an MSXML.DOMDocument
to do so. Assume that the user supplies a folder for where to save all of the XML files.
Public Sub WriteXMLFiles(XMLFolderPath As String) Dim con As ADODB.Connection Dim cmd As ADODB.Command Dim rst As ADODB.Recordset Dim xml As MSXML2.DOMDocument60 Dim pi As MSXML2.IXMLDOMProcessingInstruction Set con = New ADODB.Connection Set cmd = New ADODB.Command Set rst = New ADODB.Recordset Set xml = New MSXML2.DOMDocument60 With con .ConnectionString = "..." .Open End With With cmd Set .ActiveConnection = con .CommandType = adCmdStoredProc .CommandText = "dbo.uspWriteXML" .Execute End With rst.Open cmd Do Until rs.EOF xml.loadXML rst.Fields("XMLMessage") Set pi = xml.createProcessingInstruction("xml", _ "version=""1.0"" encoding=""" & rst.Fields("XMLEncoding") & """") xml.insertBefore pi, xml.firstChild xml.Save XMLFolderPath & rst.Fields("XMLFileName") rs.MoveNext Loop End Sub
Note how we don’t have to worry about the encoding since the MSXML library will take care of it for us as we supply the specific processing instructions. The uspWriteXML
procedure indicates which encoding it should be so if any encoding needs to be changed, it’s a matter of updating the uspWriteXML
procedure or even better, creating a table that uspWriteXML
procedure can read from. No need to touch the front-end code at all. Alternatively, the user might want to control the encoding, which is also doable with a small revision to the VBA code.
Using SSIS
The SSIS supports some XML manipulations but frankly, it leaves a lot to be desired. For example, if you try to use XML Source task, it would generate as many outputs as there are levels in an XML document. That leaves you with doing additional tasks to merge the various columns from different outputs, resulting in a lot of steps. It also allows you to use XSLT via an XML Task to transform XML so that the output can be managed but now you’re maintaining an XSLT and a SQL procedure to process the XML. Thanks, but no thanks. Then, the greatest sin: No XML Destination.
Fortunately, all is not lost. As shown above, we can manipulate XML directly in the T-SQL so the only thing an SSIS package needs to do is just push the bytes from XML file into T-SQL or write the bytes back into a new file. We’ll assume we will use stored procedures to handle the XML-SQL translation. We’ll be using the uspReadXML
stored procedure defined earlier in the article.
You may recall that the uspReadXML
stored procedure will expect a varbinary(MAX)
parameter. SSIS does not have an xml data type available to the pipeline (unless you want to use an object variable and script tasks to manipulate the variable). By sending the data in as a stream of bytes, we can use the statement CAST(@binXml AS xml)
to let SQL Server do all the hard work of parsing the right encoding and then convert the text into a common representation, all ready for safe consumption.
In the SSIS package, we use Script Task. Let’s look at the code required to read in the file and submit it to the stored procedure. We assume that we have configured the ReadOnly
variables and setup a ADO.NET connection manager accordingly.
public void Main() { SqlConnection conn = (SqlConnection)Dts.Connections["MyADONETConnection"] .AcquireConnection(null); SqlCommand comm = new SqlCommand(); comm.Connection = conn; comm.CommandType = CommandType.StoredProcedure; comm.CommandText = "dbo.uspReadXML"; comm.Parameters.Add("@binXML", SqlDbType.VarBinary); comm.Parameters["@binXML"].Value = File.ReadAllBytes((string)Dts.Variables["TempFileName"].Value); comm.ExecuteNonQuery(); Dts.Connections["MyADONETConnection"].ReleaseConnection(conn); Dts.TaskResult = (int)ScriptResults.Success; }
As you can see, the code requires no specific knowledge of the content of XML — this script then therefore, will work with any kind of XML input and if you choose to use variables for the CommandText
property and for the parameter(s), then it would also be able to select the right stored procedure dynamically at the runtime. Sweet!
Note also that if you don’t care for portability, you can use SqlDbType.Xml
instead of SqlDbType.VarBinary
, load it up using ReadAllText()
instead of ReadAllBytes()
, use xml
instead of varbinary(MAX)
as the stored procedure’s parameter and skip the CAST()
step within the stored procedure.
Now, what about writing out XML files? While there are several different ways to do this, I opted to build a table that contains 3 columns, one column representing the filename, another column representing the XML document, and the last one the encoding of the XML file. The XML is output as a nvarchar(MAX)
. Why? Well, the answer is that SQL Server handles XMLs using only one encoding — ucs-2 (aka utf-16). There is no way to change the encoding of XML inside the SQL Server, nor can you add processing instructions to the xml. Some applications may refuse to accept an XML file if they do not have the processing instructions. Others may insist that the xml be encoded in utf-8, not in utf-16. Therefore, the best way to manipulate the encoding and processing instructions is to do it inside a script. For that reason, we take in nvarchar(MAX)
as the input, then let .NET’s XML objects do all the magic. We will use the uspWriteXML
stored procedure.
Because the stored procedure returns a resultset, we’ll want to use SSIS Data Flow Task, execute the stored procedure as a source, then connect it to a Script Component within the Data Flow task. Here’s the code we would use in the Script Component, assuming we pass in a ReadOnly
variable WorkingPath
:
public override void Input0_ProcessInputRow(Input0Buffer Row) { string fileName = String.Concat( Variables.WorkingPath, Row.XMLFileName ); int len = (int)Row.XMLMessage.Length; XmlDocument xml = new XmlDocument(); XmlDeclaration dec = xml.CreateXmlDeclaration( "1.0", Row.XMLEncoding, null ); xml.LoadXml(Encoding.Unicode.GetString( Row.XMLMessage.GetBlobData(0, len) )); xml.InsertBefore(dec, xml.FirstChild); xml.Save(fileName); }
Even though the SQL Server stored procedure and C# string both handled the data as ucs-2/utf-16, by setting up the XmlDocument
and also the XmlDeclaration
, we were able to specify that encoding be changed to utf-8 and specify the processing instructions for the XML file before saving it. Beyond the processing instructions, the task has no specific knowledge of XML content and therefore, it works with any kind of XML file. This way, you can build many different XML files in one go.
By handling the XML parsing entirely at the SQL level, we’re able to build an SSIS package that won’t need to be edited should XML’s structure change. We can just edit the stored procedure and keep running the SSIS packages as before. A big win on the maintenance front.
Hopefully, this article helps provide some idea as to how to handle XML files efficiently and more importantly, with a minimum number of steps, helping to make your code much easier to maintain. Please share your experiences with us.
[…] Earlier this year Ben posted a 2 part blog posts on this topic and this will be an update/improved version of these resources. Here are the links: Part 1 & Part 2 […]