TreeShare update operations that add images from Ancestry to the RootsMagic database can result in successive downloads of the same image file saved to a revised name, each added to the Media Gallery and independently tagged to their respective subjects. I don’t think this is either necessary or desirable (maybe there is a good reason yet to be revealed by the RootsMagician!). Moreover, such duplication (or worse, multiplication) bloats usage of storage space and inflates the Media Gallery making it harder to manage, navigate, find things. The few respondents to my survey reported 10-20% of their media files were duplicates while an exercise with a 10-person subset of my 420 person Ancestry Tree after many TreeShare ops trying to resolve differences was far higher. The discussion and scripts that follow help understand the magnitude of the bloat and offer surgical tools to reduce it. These tools reduced the 92 media in the Gallery to 49.
The issue predates TreeShare and could arise from combining databases but it is likely that it had not been noticed as frequently or to the same degree. Earlier effort dating back to RM4 (see Media Repair Queries) concerned itself with duplicate file names in different folders, not with duplicate content within differently named files.
RootsMagic places all TreeShare media from Ancestry in one folder for a given database. It does not attempt to substitute an existing duplicate media file for the one to be or is being downloaded; rather, it modifies the name of the downloading duplicate file with a unique suffix. Consequently, the old script would not even detect these differently named files with identical content, only those having the same name in separate folders. Moreover, it could generate false positives for files of the same name with different content leading to the deletion of potentially valuable information.
By comparing RootsMagic thumbnails
RootsMagic’s database has a property for each image file that can be used as a basis for finding duplicate image content – the thumbnail image it creates of the image file. This thumbnail is stored in the database in the Thumbnail field in the MultiMediaTable. However, because it is a tiny thumbnail, it is a crude approximation of the original and it may be possible that similar images, such as some pairs of Census pages or BMD Indexes could give rise to identical thumbnails. Experience will tell. It’s a very convenient parameter to exploit as seen in this query result from a tiny, test database.
Examining the database’s MultimediaTable shows the duplicate file’s revised name:
In this case, we can confidently surmise from the common part of the filenames AND the identical thumbnails that these undoubtedly have duplicate content.
By using CCleaner
Other files may not have strongly similar filenames despite identical thumbnails and, given the risk of false positives from the latter, a further test is required. The free CCleaner and other utilities can resolve if they have the same or different content; we’ll continue here with CCleaner and its Tools > Duplicate Finder feature set to search the folders of interest for image files that match each other only by Size and by Content. Here is a screenshot of its results:
The highlighted names are of those in the database’s Media Gallery. The others are copies made using Windows File Explorer and are unused by (and therefore unrecorded in) the RM database. The script correctly reported the two matched files used by the database from their identical thumbnails.
CCleaner exports its results as a tab-delimited text file which can be imported into a spreadsheet:
Combining the data from the RootsMagic database with that from the CCleaner Duplicate Files report should provide what is needed to identify and replace duplicates in the Media Gallery and to delete the then unused duplicate files from computer storage. Combining may be done by importing the CCleaner data into the RM file, temporarily, or by exporting RM data to the spreadsheet. What follows is the first approach.
Migrate CCleaner Duplicate Search results to SQLite
With the CCleaner data in the spreadsheet, formulas can be applied to create a SQLite INSERT statement for each reported file. We will create an empty temporary table with a corresponding structure into which the generated statements will insert values from the spreadsheet. I’ve created a publicly accessible Google Sheet which anyone can copy to their own spreadsheet to work on or to use directly.
Scripts for finding Duplicates
Before running any of these scripts, you should use RM’s File > Database Tools:
- Test Database Integrity
- If #1 fails, Rebuild Indexes and repeat #1
- Delete Phantoms
And it is probably advisable to run the latest script from Delete Phantoms to clear out any detritus that may just confuse or exaggerate comparisons of Database Properties before and after the Merge Duplicates script below. Use your OS File Explorer/Manager to get the properties of the media folder(s) used by the database.
Media-DupThumbnailsList.sql Make sure you open the RM Media Gallery and let thumbnail creation complete before running this script against your database. It creates a temp table named zDupMediaTable which is displayed at the end of script execution to show you the duplicates found. The script makes no modifications to the database; this table will be used by the merge script. Note that it ignores media files that are not image files or any file type that is not in the subject database’s Media Gallery. And beware that there is a risk of false positives because thumbnail images are of such low resolution. If you intend to proceed to merge duplicates, do not close the database from the SQLite manager as that will delete the temp table needed.
Media-DupFilesList.sql You have to edit this script to include the INSERT commands you will copy from the spreadsheet before running it on the database. It creates an identically named temp zDupMediaTable to be used by the merge script and makes no changes to the database. Because Ccleaner does a bit-wise or byte-wise comparison, there is no chance of a false match between different images. However, it will not match files having exactly the same image data but a difference in just one character of metadata. If you intend to proceed to merge duplicates, do not close the database from the SQLite manager as that will delete the temp table needed.
Media-MergeDuplicates.sql This is where the rubber hits the road and database changes are afoot. You have taken safety precautions? And your SQLite manager still has the temp zDupMediaTable in the database from either of the above scripts? This script won’t do anything without it. Fire away!
When you next run TreeShare after script execution, all the people whose media has been affected by the merge and is tagged to anything related to the person will be marked as Changed and all previous Changed people will first have been cleared. Nothing material should have been changed by the script but flagging them so helps you to confirm that is the case.
Repeat the initial steps of RM Database Tools and Delete Phantoms to clean out new detritus and then RM Database Properties for comparison of Before and After.
Delete surplus duplicate media files
The last step in the script lists the commands for deleting duplicate files from the media folder(s) that are no longer used by the database. Copy and paste them into a text editor for review and to avoid their loss if you have to close the database in the SQLite manager (some RootsMagic operations such as Backup and Restore cannot proceed if the database is open in another application). If you are confident that any or all are unused by any other database, copy and paste the respective ERASE commands into the CMD shell. They will execute immediately without asking for confirmation; the script could be modified to require a confirmation of each ERASE. For a large set, break up the command set into chunks.
Repeat getting folder properties for the After results to compare with Before, i.e., number of media files and total size.
- How well this procedure will scale with the size of the database is unknown. Certainly, once some large numbers of INSERTs and ERASEs is reached, there will be issues with copying and pasting and perhaps the size of the script for the INSERTs. Nor can we predict the speed with which the scripts will execute.
- How TreeShare functions remains a mystery in part. Events for a couple, such as Marriage, are especially problematic because we are advised that Ancestry handles them as separate individual events. It seems advisable, if not necessary, to have on the Ancestry Tree identical sources for the Marriage event for each person to be assured that RM will create the couple’s Marriage event with all the sources. And updating via TreeShare sometimes is a tail-chasing exercise in which ostensibly identical citations or media on both sides are perceived as new for the other with the update one way provoking the offer of a new one the other.
- The flagging of Changed People is done by setting the Modified field to 1 for records in LinkAncestryTable whose LinkType is 0. The script also sets the Modified field to 1 for records of LinkType 4 (citations) and 11 (media) but these have no effect on anything apparent in RM 7590.
- This procedure does not provide ERASE commands for media files that are unique (not duplicates) and unused by the database. The procedure on this early page could be adapted to do so: Scrapbook Files Status