Media Duplicates – Reports and Remedies

Problem

TreeShare update operations that add images from Ancestry to the RootsMagic database can result in successive downloads of the same image file saved to a revised name, each added to the Media Gallery and independently tagged to their respective subjects. I don’t think this is either necessary or desirable (maybe there is a good reason yet to be revealed by the RootsMagician!). Moreover, such duplication (or worse, multiplication) bloats usage of storage space and inflates the Media Gallery making it harder to manage, navigate, find things. The few respondents to my survey  reported 10-20% of their media files were duplicates while an exercise with a 10-person subset of my 420 person Ancestry Tree after many TreeShare ops trying to resolve differences was far higher. The discussion and scripts that follow help understand the magnitude of the bloat and offer surgical tools to reduce it. These tools reduced the 92 media in the Gallery to 49.

Background

The issue predates TreeShare and could arise from combining databases but it is likely that it had not been noticed as frequently or to the same degree. Earlier effort dating back to RM4 (see Media Repair Queries) concerned itself with duplicate file names in different folders, not with duplicate content within differently named files.

RootsMagic places all TreeShare media from Ancestry in one folder for a given database. It does not attempt to substitute an existing duplicate media file for the one to be or is being downloaded; rather, it modifies the name of the downloading duplicate file with a unique suffix. Consequently, the old script would not even detect these differently named files with identical content, only those having the same name in separate folders. Moreover, it could generate false positives for files of the same name with different content leading to the deletion of potentially valuable information.

Finding Duplicates

By comparing RootsMagic thumbnails

RootsMagic’s database has a property for each image file that can be used as a basis for finding duplicate image content – the thumbnail image it creates of the image file. This thumbnail is stored in the database in the Thumbnail field in the MultiMediaTable. However, because it is a tiny thumbnail, it is a crude approximation of the original and it may be possible that similar images, such as some pairs of Census pages or BMD Indexes could give rise to identical thumbnails. Experience will tell. It’s a very convenient parameter to exploit as seen in this query result from a tiny, test database.

Examining the database’s MultimediaTable shows the duplicate file’s revised name:
 In this case, we can confidently surmise from the common part of the filenames AND the identical thumbnails that these undoubtedly have duplicate content.

By using CCleaner

Other files may not have strongly similar filenames despite identical thumbnails and, given the risk of false positives from the latter, a further test is required. The free CCleaner and other utilities can resolve if they have the same or different content; we’ll continue here with CCleaner and its Tools > Duplicate Finder feature set to search the folders of interest for image files that match each other only by Size and by Content. Here is a screenshot of its results:
The highlighted names are of those in the database’s Media Gallery. The others are copies made using Windows File Explorer and are unused by (and therefore unrecorded in) the RM database. The script correctly reported the two matched files used by the database from their identical thumbnails.

CCleaner exports its results as a tab-delimited text file which can be imported into a spreadsheet:

CCleaner Duplicate Files report imported into Google Sheets; headers added by user.

Combining the data from the RootsMagic database with that from the CCleaner Duplicate Files report should provide what is needed to identify and replace duplicates in the Media Gallery and to delete the then unused duplicate files from computer storage. Combining may be done by importing the CCleaner data into the RM file, temporarily, or by exporting RM data to the spreadsheet. What follows is the first approach.

Migrate CCleaner Duplicate Search results to SQLite

With the CCleaner data in the spreadsheet, formulas can be applied to create a SQLite INSERT statement for each reported file. We will create an empty temporary table with a corresponding structure into which the generated statements will insert values from the spreadsheet. I’ve created a publicly accessible Google Sheet which anyone can copy to their own spreadsheet to work on or to use directly.

Scripts for finding Duplicates

Before running any of these scripts, you should use RM’s File > Database Tools:

  1. Test Database Integrity
  2. If #1 fails, Rebuild Indexes and repeat #1
  3. Delete Phantoms

And it is probably advisable to run the latest script from Delete Phantoms to clear out any detritus that may just confuse or exaggerate comparisons of Database Properties before and after the Merge Duplicates script below. Use your OS File Explorer/Manager to get the properties of the media folder(s) used by the database.

Thumbnails

Media-DupThumbnailsList.sql Make sure you open the RM Media Gallery and let thumbnail creation complete before running this script against your database.  It creates a temp table named zDupMediaTable which is displayed at the end of script execution to show you the duplicates found. The script makes no modifications to the database; this table will be used by the merge script.  Note that it ignores media files that are not image files or any file type that is not in the subject database’s Media Gallery. And beware that there is a risk of false positives because thumbnail images are of such low resolution. If you intend to proceed to merge duplicates, do not close the database from the SQLite manager as that will delete the temp table needed.

Files

Media-DupFilesList.sql You have to edit this script to include the INSERT commands you will copy from the spreadsheet before running it on the database. It creates an identically named temp zDupMediaTable to be used by the merge script and makes no changes to the database. Because Ccleaner does a bit-wise or byte-wise comparison, there is no chance of a false match between different images. However, it will not match files having exactly the same image data but a difference in just one character of metadata. If you intend to proceed to merge duplicates, do not close the database from the SQLite manager as that will delete the temp table needed.

Merging Duplicates

Media-MergeDuplicates.sql This is where the rubber hits the road and database changes are afoot. You have taken safety precautions? And your SQLite manager still has the temp zDupMediaTable in the database from either of the above scripts? This script won’t do anything without it. Fire away!

When you next run TreeShare after script execution, all the people whose media has been affected by the merge and is tagged to anything related to the person will be marked as Changed and all previous Changed people will first have been cleared. Nothing material should have been changed by the script but flagging them so helps you to confirm that is the case.

Cleanup

Repeat the initial steps of RM Database Tools and Delete Phantoms  to clean out new detritus and then RM Database Properties for comparison of Before and After.

Delete surplus duplicate media files

The last step in the script lists the commands for deleting duplicate files from the media folder(s) that are no longer used by the database. Copy and paste them into a text editor for review and to avoid their loss if you have to close the database in the SQLite manager (some RootsMagic operations such as Backup and Restore cannot proceed if the database is open in another application). If you are confident that any or all are unused by any other database, copy and paste the respective ERASE commands into the CMD shell. They will execute immediately without asking for confirmation; the script could be modified to require a confirmation of each ERASE. For a large set, break up the command set into chunks.

Repeat getting folder properties for the After results to compare with Before, i.e., number of media files and total size.

Miscellaneous

  1. How well this procedure will scale with the size of the database is unknown. Certainly, once some large numbers of INSERTs and ERASEs is reached, there will be issues with copying and pasting and perhaps the size of the script for the INSERTs. Nor can we predict the speed with which the scripts will execute.
  2. How TreeShare functions remains a mystery in part. Events for a couple, such as Marriage, are especially problematic because we are advised that Ancestry handles them as separate individual events. It seems advisable, if not necessary, to have on the Ancestry Tree identical sources for the Marriage event for each person to be assured that RM will create the couple’s Marriage event with all the sources. And updating via TreeShare sometimes is a tail-chasing exercise in which ostensibly identical citations or media on both sides are perceived as new for the other with the update one way provoking the offer of a new one the other.
  3. The flagging of Changed People is done by setting the Modified field to 1 for records in LinkAncestryTable whose LinkType is 0. The script also sets the Modified field to 1 for records of LinkType 4 (citations) and 11 (media) but these have no effect on anything apparent in RM 7590.
  4. This procedure does not provide ERASE commands for media files that are unique (not duplicates) and unused by the database. The procedure on this early page could be adapted to do so: Scrapbook Files Status

5 Replies to “Media Duplicates – Reports and Remedies”

  1. The following was posted to the RootsMagic Forums by jrueger:

    RM says 6525 files in the Media folder and CCleaner says 2,174 are duplicates. All of these would be from TreeShare since I haven’t done any other database work such as restores or extensive manual addition of media files.

    I have used your SQLite tools for media duplication but don’t understand two key sections:
    1. the flag for TreeShare is reset (as expected) and now shows “new” sources for every person affected – should I accept these new sources or do nothing. That is not clear to me from your notes.

    2. the exact steps to follow to delete the duplicates is not clear. The instructions as comments in the .sql file seem to be vague as to exactly what commands should be executed – perhaps that is because I don’t understand the SQLite commands that well. Something needs to be executed at the Windows command line but the commands in the sql file are not Windows or DOS commands – that is the part that is confusing to me. How exactly do I identify the files that can now be deleted?

    Thanks for your tools and help.

    Ans:
    1. My experience and description was that having set the flag on the citations had no effect on the Changed detection for Ancestry Sources so no action was needed. I will have to revisit that but you could do us a favour by processing a change and reporting what the effect is on both ends.

    2. The script Media-MergeDuplicates.sql has as its last statement a SELECT which generates the ERASE commands to be copied into the Windows CMD shell for execution. Each duplicate file has its own ERASE command. You can inspect the results in the SQLite manager that executed the script or copy them into a text editor for review and editing out those you might wish to preserve. You could probably bulk edit the (long) list of commands and convert them to MOVE.

    Sorry if my notes were not clear enough – I’m not a trained tech writer.

    Tom

  2. I reran the scripts again and everything was plain as day. I had no problem with my 2nd concern – the erase commands were right there in front of me. The problem I had the 1st time was probably the way I was using SQLite Expert.

    As for Treeshare here is my experience. If TreeShare shows the only change to be on the NAME field then almost always it is flagging sources that it thinks are not in RM but on inspection are indeed already in RM – and so I ignored all of those. However there were a fair number of cases where a change was flagged for the MARRIAGE field. There was a marriage in Ancestry with no corresponding marriage in RM. This was a bit strange because for many of these cases I know that there was a marriage in RM and it seemed to have been removed by running the SQL scripts. I did go through each one and added the marriage back to RM via TreeShare.

    And then I wrecked things royally by rerunning the scripts because CCleaner still showed a large number of duplicate files. However the scripts only pulled a small number needing to be dealt with. And like a dummy I reran the scripts until there were no duplicates reported. But what that did was remove all those marriages I had just added back to RM but reset the TreeShare flag every time until running TreeShare in fact reported no changes.

    So now I am laboriously going through every person via TreeShare and looking for a reported difference between Ancestry and RM. In the vast majority of cases there is no difference other than the reported sources for the NAME field as above. Where a difference is reported for other fields I either accept or investigate based on my knowledge of the person. Although this is quite time consuming for a database of over 5000 people it is actually a good exercise because it has revealed cases of either poor or erroneous sources or other data inconsistencies … i.e. it is actually a good quality control exercise.

  3. jrueger, these scripts cannot cause the problems you observe with the NAME sources and the disappearing MARRIAGE facts.

    The NAME sources on Ancestry cannot match in RM7.5 because RM does not support NAME sources. Likewise the GENERAL (or PERSONal) sources in RM have no match in the Ancestry Tree. The mismatch has existed since the release of TreeShare. The RootsMagician has promised to include support for NAME sources in RM8. I don’t know what the future is for GENERAL sources on either side.

    RootsMagic Support has frequently cited a difference between RM and AMT as the cause of missing MARRIAGE facts or Marriage facts with no spouse. AMT purportedly does not have support for what RM calls “family-type” facts, such as Marriage, Divorce, etc. and transmits them through the API used by TreeShare as what RM calls “individual-type” facts. RM has to invent its “family-type” Marriage event from two separate AMT “individual-type” Marriage events. AMT allows a Marriage event with no spouse identified and that can result in an individual in RM getting this Marriage event and the spouse individual in RM missing the Marriage event. The scripts do not modify or delete the Marriage events; it can only change a link to a duplicate media file (from a Marriage citation) to the t0-be-unduplicated media file.

    So I would suggest that the problems you observe predate the usage of the scripts. Both your Before-script database and your After-script database can be connected to TreeShare at the same time to help with your comparison.

    As to re-running scripts, maybe one pass fails to address all duplicates and leaves some in place. It is entirely possible that you have duplicate files that are not even used by the database. I intended that if the procedure starts with the CCleaner duplicates report and you have successfully executed ALL of the CMD ERASEs, subsequent CCleaner duplicates test of the same folder(s) should find none. That is NOT the case if the procedure is started using the duplicate thumbnails paradigm – the risks have been identified – because it can only work on those files that have thumbnails in the Media Gallery. Duplicate files that are not in the Media Gallery and those that are in but have not had a thumbnail generated in the subject RM database remain untouched and will be detected by CCleaner. Also, it is likely that the loss of the Marriage events you added is because you accidentally reverted to an earlier version of your database before adding them.

  4. Second thought – the Delete Phantoms script can affect events. I will have to check if it is doing something adverse to the Marriage events.

  5. Thanks for the updates. I’ll try to keep better notes and logs as I work with RM and TreeShare – and learn more about SQLite. That way I can better understand how things work and fine-tune my procedures.

Leave a Reply