Text in blobs

#1 · 2023-05-07, 9:48 am

Quote from thejerrybryan on 2023-05-07, 9:48 am
I have been meaning to write and ask about this for a long time and never got around to it. I was inspired to go ahead by Tom's new script for fixing problems with newspapers.com citations from ancestry.com. In particular, Tom's new script needs to deal with the XML data in the Fields column in CitationTable and I think also in SourceTable. I have only glanced through the scripts so far since I have no need of them myself.
Because of my proclivity for splitting sources, I have a good deal of experience dealing with SourceTable.Fields and CitationTable.Fields - moving data between them and adjusting them in other ways. And recently I have been doing some cleanup work in my RM7 database in preparation for RM9 into my production RM database.
The Fields columns are blob columns which contain text. It has never made sense to me why the columns are blob. It seems to me that blob columns should be reserved for data that is truly binary such as thumbnails of media files and that sort of thing. But RM does use blob columns sometimes, and it seems to "just work". Except that I can see the difference in looking at the Fields column using SQLiteSpy. I don't know about other SQLite managers, but SQLiteSpy displays the Fields columns in hex unless I CAST them to text.
Like Tom's new script, I have done a lot of processing similar to the following.
UPDATE CitationTable
SET Fields = some_value
WHERE some_condition
I have discovered that after having done so, those rows I have changed have become textified. SQLiteSPY shows those rows as text while still showing the unmodified rows in hex. But RM still seems to "just work" with both the textified rows and the rows that are still pure "text stored in a blob".I can't even begin to picture what's going on behind the scenes in SQLite where some of the rows are textified and some are not. And I don't really know what I mean as textified except that SQLiteSpy shows them as text instead of as hex.
But nonetheless, I ran into a problem. Part of the cleanup of my RM7 database in preparation for conversion to RM9 involved running scripts that make heavy use of the GROUP BY clause. For the columns specified in the GROUP BY clause, SQLITE performs comparisons of those columns row by row looking for duplicates. When all the GROUP BY columns match between two or more rows, then those rows are grouped. l discovered that for the GROUP BY clause, the Fields columns in the textified rows and non-textified rows were not being matched even though the actual information in the rows was identical.
Upon further investigation, it turned out that the LENGTH of the Fields columns in the textified rows matched exactly with the number of characters in the Fields column, but the LENGTH of the Fields column in the non-textified fields was one or two byes longer than the number of characters in the Fields column. It appears to me that the one or two extra bytes are part of the UTF-8 encoding of the UNICODE characters in which the Fields column is represented for internal processing. And the GROUP BY clause appears to be doing a pure binary comparison of the Fields column even though some of the rows are textified and some are not. Therefore, the otherwise identical Fields columns will not match between textified and non-textified rows.
In any case, I fixed my problem by doing the following.
UPDATE SourceTable
SET Fields = CAST(Fields AS BLOB)
UPDATE CitationTable
SET Fields = CAST(Fields AS BLOB)
This de-textifies all the rows in SourceTable and CitationTable that have been textified and apparently does nothing at all to the rows that are not textified.
I suspect that Tom will add further insight and clarity to this issue. And I suspect that some of what I think is correct is probably wrong. But that's what I know so far. I'm not sure what I'm going to do moving forward when I need to update the Fields column in either SourceTable or CitationTable. Getting some of the rows textified didn't seem to matter in RM at all and didn't seem to matter in SQLiteSpy at all until I started running scripts that did a GROUP BY on the FIELDS column.
Two simple solutions come to mind. One is to write updates to the Fields columns as follows.
UPDATE CitationTable
SET Fields = CAST(some_value AS BLOB)
WHERE some_condition
The other solution would be the one I have already used to de-textify all the Fields columns all in one go, whether they all really need it or not.

I have been meaning to write and ask about this for a long time and never got around to it. I was inspired to go ahead by Tom's new script for fixing problems with newspapers.com citations from ancestry.com. In particular, Tom's new script needs to deal with the XML data in the Fields column in CitationTable and I think also in SourceTable. I have only glanced through the scripts so far since I have no need of them myself.

Because of my proclivity for splitting sources, I have a good deal of experience dealing with SourceTable.Fields and CitationTable.Fields - moving data between them and adjusting them in other ways. And recently I have been doing some cleanup work in my RM7 database in preparation for RM9 into my production RM database.

The Fields columns are blob columns which contain text. It has never made sense to me why the columns are blob. It seems to me that blob columns should be reserved for data that is truly binary such as thumbnails of media files and that sort of thing. But RM does use blob columns sometimes, and it seems to "just work". Except that I can see the difference in looking at the Fields column using SQLiteSpy. I don't know about other SQLite managers, but SQLiteSpy displays the Fields columns in hex unless I CAST them to text.

Like Tom's new script, I have done a lot of processing similar to the following.

UPDATE CitationTable
SET Fields = some_value
WHERE some_condition

I have discovered that after having done so, those rows I have changed have become textified. SQLiteSPY shows those rows as text while still showing the unmodified rows in hex. But RM still seems to "just work" with both the textified rows and the rows that are still pure "text stored in a blob".I can't even begin to picture what's going on behind the scenes in SQLite where some of the rows are textified and some are not. And I don't really know what I mean as textified except that SQLiteSpy shows them as text instead of as hex.

But nonetheless, I ran into a problem. Part of the cleanup of my RM7 database in preparation for conversion to RM9 involved running scripts that make heavy use of the GROUP BY clause. For the columns specified in the GROUP BY clause, SQLITE performs comparisons of those columns row by row looking for duplicates. When all the GROUP BY columns match between two or more rows, then those rows are grouped. l discovered that for the GROUP BY clause, the Fields columns in the textified rows and non-textified rows were not being matched even though the actual information in the rows was identical.

Upon further investigation, it turned out that the LENGTH of the Fields columns in the textified rows matched exactly with the number of characters in the Fields column, but the LENGTH of the Fields column in the non-textified fields was one or two byes longer than the number of characters in the Fields column. It appears to me that the one or two extra bytes are part of the UTF-8 encoding of the UNICODE characters in which the Fields column is represented for internal processing. And the GROUP BY clause appears to be doing a pure binary comparison of the Fields column even though some of the rows are textified and some are not. Therefore, the otherwise identical Fields columns will not match between textified and non-textified rows.

In any case, I fixed my problem by doing the following.

UPDATE SourceTable
SET Fields = CAST(Fields AS BLOB)
UPDATE CitationTable
SET Fields = CAST(Fields AS BLOB)

This de-textifies all the rows in SourceTable and CitationTable that have been textified and apparently does nothing at all to the rows that are not textified.

I suspect that Tom will add further insight and clarity to this issue. And I suspect that some of what I think is correct is probably wrong. But that's what I know so far. I'm not sure what I'm going to do moving forward when I need to update the Fields column in either SourceTable or CitationTable. Getting some of the rows textified didn't seem to matter in RM at all and didn't seem to matter in SQLiteSpy at all until I started running scripts that did a GROUP BY on the FIELDS column.

Two simple solutions come to mind. One is to write updates to the Fields columns as follows.

UPDATE CitationTable
SET Fields = CAST(some_value AS BLOB)
WHERE some_condition

The other solution would be the one I have already used to de-textify all the Fields columns all in one go, whether they all really need it or not.

#2 · 2023-05-07, 10:14 am

I've little to add to that excellent description, Jerry. I'd observed the blob-text variation from early days with no apparent detriment but have also been casting text back to blobs when I remember to do so. However, I've not had reason to Group on the fields in question. One caveat from my last post on lumping Newspapers.com sources is that a Fields blob containing multiple XML fields in SourceTable or CitationTable does not necessarily have those fields in a constant order. So despite the XML field values being identical between two rows, Group will keep them in separate groups.

Why did the order change? I can only speculate that it was due to a change in TreeShare at one end or the other. I vaguely recall something about the ordering of fields in the Source Template definition - maybe that was revised along the way.

#3 · 2023-05-07, 11:47 am

Regarding whether the blob columns are displayed as hex or text, there is another variable that may or may not be relevant.

RM v7 would add a BOM (byte order mark) at the start of each blob containing XML. Those 2 bytes are not text and maybe SQLspy doesn't know how to handle it. Ver 8 and 9 no longer add the BOM and it is removed when updated. RM 8 & 9 also leave out the XML header line (something like <?xml version="1.0" encoding="UTF-8" ?> )
These are big improvement on how RM handles XML.