The following script was developed in response to the discussion Search for family by given names of children (was RM-8)
-- three_childrens_names.sql
-- Jerry Bryan, 22 May 2020
-- This query searches for parents based on given names of their children.
-- It gets a hit when three or more children have any primary or alternate
-- given names that are in a hardwired list. The report lists the parents.
SELECT COUNT(C.ChildID) AS fam_size,
F.FatherID, father.Given, father.Surname, F.MotherID, mother.Given, mother.Surname
FROM ChildTable AS C
JOIN
(SELECT DISTINCT NN.OwnerID
FROM NameTable AS NN
WHERE NN.Given LIKE("%Mary%") OR NN.Given LIKE ("%John%") OR NN.Given LIKE("%Elizabeth%"))
AS N ON N.OwnerID = C.ChildID
JOIN
FamilyTable AS F ON F.FamilyID = C.FamilyID
JOIN
NameTable AS father ON father.OwnerID = F.FatherID
JOIN
NameTable AS mother ON mother.OwnerID = F.MotherID
GROUP BY C.FamilyID
HAVING fam_size >= 3
ORDER BY father.Surname, father.Given, mother.Surname, mother.Given;
Jerry
Very efficient query with fast results on a 160,000 person database.
As written, it lists a couple with 3 or more children, any one of which has a given name containing one of the test names. To make it list only those couples having 3 or more children whose given names together contain all 3 of the test names, simply change the OR between the test names to AND.
I got an anomalous result in the large database (mostly French names). This couple with only one child (“Mary” as an Alternate Name) is listed with fam_size=4. Then I noticed that fam_size was only shown as 3 or 4 and reviewed other families and found that the number of children was far larger and that fam_size reflected the number of children for which there was a match to the test name. It appears there are two issues:
1. “fam_size” is a misnomer; maybe it should be “matches”
2. I don’t see why the single-child should have qty4 for either “fam_size” or “matches”
For #1, I have not explored closely to see if “matches” might be consistently more appropriate (ignoring the exception of #2).
For #2, RM shows no multiple linking of the single child whose primary name is “M Hortense” and Alt Name is “Mary H”. Nor are her parents double-linked.
Quite a mystery…
The mystery is revealed – Alt Names for the parents results in an over-count. This revision adding the IsPrimary test for the fetching of the parents’ names corrects it and “matches” is the more useful name for the COUNT().
…
JOIN
NameTable AS father ON father.OwnerID = F.FatherID AND father.IsPrimary
JOIN
NameTable AS mother ON mother.OwnerID = F.MotherID AND mother.IsPrimary
…
From my testing of this SQL the counting appears to be applied to all children of the family plus any alternate names – thus giving numerous false positives.
Here is my version which finds only families with all three children. The only issue might be the inclusion of alternate names for either parent – which might be useful – which be eliminated with Tom’s previous suggestion:
-- Create views of Children - one for each name
-- Mary
DROP VIEW IF EXISTS Child1View;
CREATE TEMP VIEW Child1View AS
SELECT C.ChildID, C.FamilyID
FROM ChildTable AS C
JOIN NameTable AS N
ON N.OwnerID = C.ChildID
AND N.Given LIKE("%Mary%");
-- John
DROP VIEW IF EXISTS Child2View;
CREATE TEMP VIEW Child2View AS
SELECT C.ChildID, C.FamilyID
FROM ChildTable AS C
JOIN NameTable AS N
ON N.OwnerID = C.ChildID
AND N.Given LIKE("%John%");
-- Elizabeth
DROP VIEW IF EXISTS Child3View;
CREATE TEMP VIEW Child3View AS
SELECT C.ChildID, C.FamilyID
FROM ChildTable AS C
JOIN NameTable AS N
ON N.OwnerID = C.ChildID
AND N.Given LIKE("%Elizabeth%");
-- Find families with children Mary, John & Elizabeth
SELECT fam_size, CC.FamilyID,
F.FatherID, father.Given, father.Surname, F.MotherID, mother.Given, mother.Surname
FROM (SELECT COUNT(C.ChildID) AS fam_size, C.FamilyID
FROM (SELECT C1.ChildID, C1.FamilyID
FROM Child1View C1
UNION ALL
SELECT C2.ChildID, C2.FamilyID
FROM Child2View C2
UNION ALL
SELECT C3.ChildID, C3.FamilyID
FROM Child3View C3) AS C
GROUP BY C.FamilyID
HAVING fam_size >= 3) AS CC
JOIN FamilyTable AS F
ON F.FamilyID = CC.FamilyID
JOIN NameTable AS father
ON father.OwnerID = F.FatherID
JOIN NameTable AS mother
ON mother.OwnerID = F.MotherID
ORDER BY father.Surname, father.Given, mother.Surname, mother.Given;
DavidF, I ran a comparison of your script against Jerry’s, both of which I modified to eliminate the parents’ Alt Names from any consideration because:
1. Jerry’s reported a false fam_size if parents have Alt Names
2. Yours may list the Alt Name instead of the Primary (it lists only one name per family)
The results had 8 in common; yours listed 2 more. On inspection of those, it appears that your query counts a person named “Mary Elizabeth” or “Elizabeth Mary” twice while Jerry’s counts it once. In that sense, Jerry’s fam_size is the count of children that have one or more matches to the defined names while yours is counting the number of matches to the defined names. I think the former is the objective. A simple modification of yours that makes it correspond to Jerry’s is to delete the “ALL” from the “UNION ALL” phrases; that is equivalent to his SELECT DISTINCT.
Yours took 79s for the 1st pass; 1.73s on the 2nd pass.
Jerry’s took 6s and 0.5s
That shows that SQLite did some serious optimisation in the first pass in both cases with 45:1 and 12:1 improvements respectively. Jerry’s is the more direct code giving it a speed advantage.
Thanks for contributing to the discussion. I can see how useful a search of this sort within a database could be. I’ve certainly used childrens’ or siblings’ names in the Ancestry.com search tool to advantage.
This was not an extremely large and complicated query that required a lot of struggling to develop. Nevertheless, I struggled about whether to include Alternate Names or not in the script. I finally decided that if someone in the database has a Primary name of Jane and an Alternate name of Elizabeth J. that we would want to include her if we were searching for Elizabeth.
But I almost wish had I just filtered out the Alternate Names from the get go. A beta version of this script (not posted here) mangled the handling of Alternate Names for the children. The 1.0 version of this script (the one posted here) mangled the handling of Alternate Names for the parents. I have been away from the computer for several hours, and fortunately Tom fixed my second mangle in my absence.
When I ran the script in my own database, it suggested several errors in my own database where the same child was possibly listed twice and possibly needed to be merged or at least investigated. These were cases where the same family had more than one child named Mary or John or Elizabeth. They showed up in the report as families for which fam_size > 3. This suggests a new script to detect these possible “duplicate” children in a more general manner. RM’s duplicate searching looks at data in addition to name and does not seem to consider family relationships at all.
I first wrote a pre-beta of the script using a UNION of searches for the individual names. I never posted it anywhere and I posted the shorter version instead. It initially seemed like the most logical approach. And it certainly helps to clarify searches when there is somebody in the database named Mary Elizabeth and you are searching either for Mary or for Elizabeth. It is often the case that form follows function, so it shouldn’t be surprising that DavidF’s version and my pre-beta version were very similar.
For many years, my SQLite scripts made heavy use of views. I have moved away from using views so much because they can result in queries that are much slower than queries not using views. These days, I tend to use sub-queries instead, and indeed I suspect I overuse sub-queries. But the sub-queries give me much of the same effect as views without the performance penalty.
As an example of the overuse of sub-queries consider the following very simple query that is written without sub-queries. It finds all the people in the database whose surname (primary or alternate) includes the character string ‘bryan’ and who are color coded in some color other than black.
SELECT P.PersonID, N.Given, N.Surname, P.Color
FROM PersonTable AS P
JOIN
NameTable AS N ON N.OwnerID = P.PersonID
WHERE N.Surname LIKE('%bryan%') AND P.Color != 0;
Here is the same query in the style of subqueries. What it does is move pieces of the WHERE clause inside each sub-query conceptually to filter each table before the tables are joined. It works just as well and runs just as fast to apply the WHERE filter outside the JOIN.
In a highly theoretical sense, applying the filter before the JOIN should run much faster than applying the filter after the JOIN, but all relational database systems have optimizers that more or less move the WHERE filter before the JOIN for you without changing your source code at all. Even so, I like the sub-query style and it gives me the same sense of ease of design as did using views.
SELECT P.PersonID, N.Given, N.Surname, P.Color
FROM (SELECT PP.PersonID, PP.Color
FROM PersonTable AS PP WHERE PP.Color != 0) AS P
JOIN
(SELECT NN.Given, NN.Surname, NN.OwnerID
FROM NameTable AS NN WHERE NN.Surname LIKE('%bryan%')) AS N
ON N.OwnerID = P.PersonID;
Jerry
When I started learning and using SQLite with RM, I went straight to sub-queries because I didn’t know about Views. Then when you started posting scripts that were beautifully structured and easier to follow and test in stages because of your use of Views, I began to do so, too. I’ve not been much bothered by the performance penalty – my own databases are small. But, once in a while, I will test something on a large database that someone has sent me to debug.
I’m not criticizing views, and they truly make development of complicated queries much easier than other approaches. Indeed, my main query that I run more than any other is built on views. It is my “Missing Stuff” query, nice formal name there. But it can take 10 or 15 seconds or more to run on my 60,000 person database. I suspect it’s because there are be many levels of views that use other views internally.
I like to fix one missing item – a missing source, a missing media file, a missing obituary, a missing whatever – and then immediately rerun my “Missing Stuff” query. My “Missing Stuff” query is highly customized to the way I work with RM. More patient people would fix several missing items before rerunning the same query.
But I have discovered that sub-queries run MUCH faster than views with SQLite and RM – sometimes 10 or 20 times faster or more. So I’m trying to teach myself to do development in the same step by step way I can use with views, except doing it with sub-queries instead. I will test the sub-query as a full standalone query in one SQLiteSpy window. And when it’s ok, I will then copy and paste it into another SQLiteSpy window as a sub-query. So far, extra levels of nesting of sub-queries haven’t seemed to slow things down in any noticeable way.
I don’t think I will ever have the same ease of using SQL that I have with more procedural programming environments. I’m an assembly language programmer at heart and even now I do a whole lot more C++ programming than SQL programming. I sometimes wish that SQL made it easier to break a complex query into multiple pieces with “function like” coding that you can do with procedural languages – like having a sub-query stored in a separate file that somehow or other is INCLUDE’d into the main SQL file. Having coded a query in a separate SQLiteSpy window and then copied and pasted it into another window as a sub-query, then it feels like I’m stuck with it there if I need to make any tweaks to the sub-query. Views are also not stored in a separate file, but they do seem to be much more separated from the rest of a query than are sub-queries.
Finally, there is one subtle point about SQL and sub-queries that I’m unsure about. I like to use aliases, e.g. SELECT S.* FROM SourceTable AS S. As I understand it, sub-queries can access aliases defined outside the sub-query but not vice versa. This makes me wonder about naming conflicts if the same alias is used both inside and outside a sub-query. Scope of names such as variable names is a huge issue in C++ programming and I have a deep understanding of the scope of names in C++. But I have never found a good reference about the scope of aliases in SQLite. So just to be safe I try to avoid any possible naming conflict of aliases. I think it makes my sub-queries look funny like having SELECT SS.* FROM SourceTable AS SS to avoid a naming conflict with the alias S outside the sub-query. My extra caution may not be necessary, and it certainly isn’t necessary in C++. But avoiding any possible alias naming conflicts gives me a comfort level with using sub-queries.
Although I have tinkered with the RM Database using SQLite over the years, I haven’t written any serious SQL for over 10 years. It’s amazing how it all comes back to you. I was happy to be able to contribute something to the discussion.
I have updated my query to include the suggested improvements.
Being new to this forum, is there any way to get notified of new posts?
Thanks David
That is or has been a weak point for this WordPress platform or my ability to configure it. There is a RSS feed support – look in the sidebar menu under Meta for:
Entries feed
Comments feed
How you get notified is up to the RSS reader you use.
These feeds cover the main area of the website but not the Forum page plugin. It has its own subscriber options for email notices and RSS feed.
I must confess that I didn’t realise that bot you and Jerry had added more comments until I visited the site and noticed them in the sidebar under recent comments.
I get notices of new messages on this forum, including the full text of new messages. I see them in my email, not via RSS unless it’s the case that the RSS feed is using email to get the messages to me. I can’t remember what I did to get the messages emailed to me. I tried unsuccessfully to figure it out again so I could post directions.
WordPress is a very premier piece of software that is heavily used throughout the Internet. But in my dealings with it I have always had trouble figuring out how to use it. The product seems to try to keep screens uncluttered. It seems to me to use a “just in time” approach to showing you options and not showing you the options until it thinks you need to see them. To me, that makes it very hard to use and navigate the product. I can never find the option or feature that I need as quickly as I might wish. But it is a very functionally rich piece of software.
I fear that I’m going to conclude that RM8 suffers from the same problem, even though I’m confident I’m going to like the basic flattening of the RM8 user interface as compared to the RM7 user interface.
Script saved to file and tested with RM8 and RM7 database.