Source Detail View (Parsing XML) #citations #xml

This page was stimulated by a question Jerry Bryan posed in the discussion on the Submit Your Problem page about finding and viewing the value of the Page field for a Free Form template. That led to a series of SQLite queries culminating in this one that I thought warranted a page for further discussion and development. It reports Master Source Names and their citations showing the field names and values entered via RootsMagic 4’s Edit Source window in the green area for Source Detail.

The challenge to viewing these Source Details and other like fields is that they are stored in a single cell for each citation, in hexadecimally coded, XML-like text. In raw form, it looks like this, using SQLiteSpy:

SourceNameFields
Births: Scotland Birth & Baptisms; LDS index (ct NFS, o/l db)Blob

A BLOB is the type name for a field containing hexadecimal vales. One has to convert the UTF-8 code to text using the CAST function to convert this BLOB field, strangely called FIELDS, to a text field we’ll call SourceDetail:

SourceNameSourceDetail
Births: Scotland Birth & Baptisms; LDS index (ct NFS, o/l db)<?xml version=”1.0″ encoding=”UTF-8″?>
<Root><Fields><Field><Name>AccessType</Name><Value/></Field><Field><Name>AccessDate</Name><Value>4 March 2011</Value></Field><Field><Name>ItemOfInterest</Name><Value>Thomas Bowman entry, birth (25 Sep 1864), Dundee`|“|` Thomas Bowman bir. (1864) </Value></Field><Field><Name>SubmitData</Name><Value>C11282-1, Scotland-ODM, Source Film Number: 6035516</Value></Field></Fields></Root>

That’s better but a forest of XML tags obscures the values of interest. The following example code uses the SUBSTR function to cut off the top and tail and nested REPLACE functions to strip out the XML tags and replace with punctuation. To keep the example simple, we won’t go into bringing out the persons, couples, and facts that cite these sources.

-- Citations of all Sources with Source Names and Source Detail fieldnames and values.
SELECT SourceName,
 REPLACE(
  REPLACE(
   REPLACE(
    REPLACE(
     REPLACE(
      REPLACE(PageValue, '</Name>', ': '),
     '<Field><Name>', '; '),
    '</Value>',''),
   '<Value>',''),
  '<Value/>',''),
 '</Field>','')
 AS SourceDetail
FROM
 (
  SELECT S.Name COLLATE NOCASE AS SourceName, SUBSTR(CAST(C.FIELDS AS TEXT),68, LENGTH(CAST(C.FIELDS AS TEXT))-100) AS PageValue
  FROM CitationTable C LEFT JOIN SourceTable S USING (SourceID)
  )
ORDER BY SourceName
;

Here is some sample output:

SourceNameSourceDetail
Births: Scotland Birth & Baptisms; LDS index (ct NFS, o/l db)AccessType: ; AccessDate: 4 March 2011; ItemOfInterest: Thomas Bowman entry, birth (25 Sep 1864), Dundee`|“|` Thomas Bowman bir. (1864) ; SubmitData: C11282-1, Scotland-ODM, Source Film Number: 6035516

Under SourceDetail, you can see the field name (always a one word concatenation) followed by a colon, space, field value and semi-colon, as substituted by the REPLACE() commands. (RM uses “||” to separate a long form from an abbreviation – these were manually quoted above because WikiEditor interprets it in a table as the start of a new cell.)

The same principles could be applied to view the Master Source field names and values or anywhere else RM uses the XML-like format.

 


Playing around with the Replace characters, I came up with a format that displays rather nicely in both a text editor and in Excel:

Citations-SourceDetailExcel.png
SQLiteSpy adds the quotes around the revised SourceDetail result, probably because of the embedded Carriage Returns. If you copy from SQLiteSpy and paste directly to Excel, you lose the control codes. Instead I paste to PSPad (Notepad should also work), copy from there to Excel, and then invoke its Text Import Wizard to delimit on Tabs with ” as text qualifier. The Text Import Wizard is not available for a direct paste to Excel from SQLiteSpy.

Here’s the revised query:

-- Source Names and Source Detail fieldnames and values for Citations of all Sources, in a columnar style
SELECT SourceName||CAST (X'0D' AS TEXT) AS SourceName,
 REPLACE(
  REPLACE(
   REPLACE(
    REPLACE(
     REPLACE(
      REPLACE(PageValue, '</Name>', CAST (X'09' AS TEXT)),
     '<Field><Name>', CAST (X'09' AS TEXT)),
    '</Value>',''),
   '<Value>',''),
  '<Value/>',''),
 '</Field>',CAST (X'0D' AS TEXT))
 AS SourceDetail
FROM
 (
  SELECT S.Name COLLATE NOCASE AS SourceName, SUBSTR(CAST(C.FIELDS AS TEXT),55, LENGTH(CAST(C.FIELDS AS TEXT))-87) AS PageValue
  FROM CitationTable C LEFT JOIN SourceTable S USING (SourceID)
  )
ORDER BY SourceName
;

I tried this query with three other current versions of software:

  1. SQLite Developer 3.8.3.496: results visible on screen, not as nice as SQLiteSpy; can directly copy/paste to Excel (no added quotes) and invoke Text Import Wizard, thus saving the intermediate paste/copy needed by Spy but first SourceDetail field is preceded with two tabs instead of one as programmed causing it to start in third column while all subsequent ones start in the intended second column. Simply changing the SUBSTR parameters from 55 to 68 and -87 to -100 solves that problem. So I prefer Spy for direct viewing of results – selecting a cell of results shows the full formatted result in the bottom-most window; no such window exists in Developer – the cell expands to the full width but the data is displayed as one row. I prefer Developer for the quality and efficiency of its export to Excel and a text editor (no added quotes and steps).
  2. SQLite2009Pro 3.7.6.3: results in SourceDetail column not visible on screen and improperly exported to Excel.
  3. SQLiteStudio 2.0.12: results in SourceDetail column not visible on screen and improperly exported to Excel.
  4. SQLite Expert Personal 3.3.28.2158: best on-screen display of results; copy/paste to Excel allows invoking Text Import Wizard, required to preserve tab and carriage returns but one tab seems to be lost. Perhaps query can be tuned.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.