Disadvantages of using COLLATE NOCASE statements?

#1 · 2025-02-09, 5:37 pm

Hello all.

I think this might be the first time I've posted here. First off, I'd like to thank Tom Holden, Richard Otter and everyone else here for so selflessly sharing their knowledge and time in this forum. Over the last 14 years since I started using RM, I've spent many hours reading the posts, documentation and using your scripts. You have been a huge help to me.

In my current environment (on an M3 Macbook), and using a sqlite client that doesn't support adding plugins, Unifuzz isn't an option. So I've been using COLLATE NOCASE statements in my SQL statements, and it seems to be working fine.

I just want to make sure I fully understand the ramifications. My understanding is by not using RMNOCASE or Unifuzz -- sorting of non-English UTF-8 characters will likely be less than ideal. With my genealogy, that is not a big issue. I just want to check if there are any other consequences I'm unaware of.

Thanks,

Michael

kevync has reacted to this post.

#2 · 2025-02-10, 9:55 am

Curious what others say about this.

#3 · 2025-02-10, 10:51 am

Welcome to the forum, Michael, and I'm gratified to hear you have quietly benefited from the website for all those years.

I haven't used COLLATE NOCASE to override RMNOCASE in over a decade, ever since coming up with the fake RMNOCASE workarounds. My vague recollection of the 'before' era was that I could not INSERT into a table with a column indexed using RMNOCASE nor UPDATE that column, i.e., the COLLATE NOCASE worked only with SELECT. It does now so either that was a subsequent development I've not noticed or my memory is faulty or I was mistaken at that early stage in learning.

What does not work with the COLLATE NOCASE override is REINDEX. Therefore, the SQLite Integrity check will always fail if you do not have available a RMNOCASE collation. That runs the risk of something being missed in Find results or disordered by a Sort as you have recognised. And any UPDATEs or INSERTs you do on RMNOCASEd tables will cause RM's Integrity Check to fail. RM's Rebuild Indexes will clear out the index errors reported.

The SQLite VACUUM command relies on REINDEX so it, too, will fail.

I remember one workaround I tried was to build a database structure with NOCASE instead of RMNOCASE and then copy everything from the RM database to the NOCASE shell. As far as I can remember, that database opened and operated fine in RM4.

#4 · 2025-02-10, 11:10 am

My use case is generating reports using only read-only queries. Sounds like I'm good.

Just curious as to the current state of fake RMNOCASE workarounds. I've never found a Unifuzz ported to MACOS. Is it still true you can't run a fake workaround natively on the Mac?

FYI, I do run SQLite Expert on Windows 11 on my M3 Mac via Parallels. The developer was even good enough to update Expert's install script to work via Parallels. My ARM processor confused the installer, even though Windows was able to run the program itself.

Michael

#5 · 2025-02-10, 12:47 pm

The RMNOCASE problem is one of the areas of using SQLite with RM where I am definitely not one of the experts you should rely on. Tom has much more experience and expertise dealing with RMNOCASE than I do. So I would offer only the following comments.

As long as you are only reporting and not updating, it's hard to see how your NOCASE override idea could ever have any adverse impact.

My environment is Windows rather than Mac, so I'm reluctant to generalize my Windows experience into the Mac world. But one thing I always do whether it seems necessary or not is that after doing an update of my RM database from outside of RM, I always do a REINDEX of my RM database from inside of RM. Doing so can do no harm, and it might catch a situation where I think the REINDEX is not necessary when in fact it actually is necessary. It's simply an abundance of caution.

#6 · 2025-02-10, 1:16 pm

Quote from Richard Otter on 2025-02-10, 1:16 pm
My thought is that doing read only queries with unifuzz64.dll providing the collation RMNOCASE without reindexing on this unifuzz64.dll provided collation could be a problem, but one that I've never seen.
It seems clear to me that if one just closes RM and then does a read only query involving RMNOCASE indexed columns against the database using the unifuzz64.dll provided collation, the query will be using the indexes already in the database created by RM using the real RMNOCASE. Since the collations are not identical, there could be issues.
I haven't done "explain plan" commands to be sure that the existing RMNOCAS collated indexes are actually used. And I don't know how a problem would show itself- a missing or erroneous result?, a malformed database error?
For my database, which includes many German umlaut characters, some Polish, and maybe a 100 Japanese Kanji characters, RM database tools check integrity gives a clean result, but when I close RM and then run pragma integrity_check from the command line using unifuzz64.dll, I get many missing index errors. (actually, 103 instances of (row NNN missing from index idxCitationName).
I'd be curious to hear what others find.
Regarding the use of NOCASE collation override, I'm committed to it for read only queries. I can be confident that they can run while the DB is open in RM. Only disadvantage is that a pre-existing index may not be available and queries may take longer. I haven't noticed however. All of my tables have fewer than a million rows.
For update queries, I never do then while RM has the file open if any of the modified data is RMNOCASE collated. And when done "offline", I always do a reindex RMNOCASE first.
I am also leaning toward creating a RMNOCASE -free schema, as Tom has described.
Oh by the way, since unifuzz64.dll isn't digitally signed, I'll provide the MD5 hash so we can all make sure we're using the same file.
MD5(unifuzz64.dll) =06a1f485b0fae62caa80850a8c7fd7c2

My thought is that doing read only queries with unifuzz64.dll providing the collation RMNOCASE without reindexing on this unifuzz64.dll provided collation could be a problem, but one that I've never seen.

It seems clear to me that if one just closes RM and then does a read only query involving RMNOCASE indexed columns against the database using the unifuzz64.dll provided collation, the query will be using the indexes already in the database created by RM using the real RMNOCASE. Since the collations are not identical, there could be issues.

I haven't done "explain plan" commands to be sure that the existing RMNOCAS collated indexes are actually used. And I don't know how a problem would show itself- a missing or erroneous result?, a malformed database error?

For my database, which includes many German umlaut characters, some Polish, and maybe a 100 Japanese Kanji characters, RM database tools check integrity gives a clean result, but when I close RM and then run pragma integrity_check from the command line using unifuzz64.dll, I get many missing index errors. (actually, 103 instances of (row NNN missing from index idxCitationName).
I'd be curious to hear what others find.

Regarding the use of NOCASE collation override, I'm committed to it for read only queries. I can be confident that they can run while the DB is open in RM. Only disadvantage is that a pre-existing index may not be available and queries may take longer. I haven't noticed however. All of my tables have fewer than a million rows.

For update queries, I never do then while RM has the file open if any of the modified data is RMNOCASE collated. And when done "offline", I always do a reindex RMNOCASE first.

I am also leaning toward creating a RMNOCASE -free schema, as Tom has described.

Oh by the way, since unifuzz64.dll isn't digitally signed, I'll provide the MD5 hash so we can all make sure we're using the same file.

MD5(unifuzz64.dll) =06a1f485b0fae62caa80850a8c7fd7c2