Forum

Please or Register to create posts and topics.

Deleting multiple duplicate urls from the source comment note

Over the years I my family tree has been imported and exported from several different family tree programs (which I'm trying to avoid now). Anyways, this has resulted in the same url being repeated several times in every "Source Comment" note. Here is an example:

https://www.ancestry.com/discoveryui-content/view/144087815:2442
Text From Source: Name: Clarence Friend
Age: 29
Estimated Birth Year: abt 1911
Gender: Male
Race: White
Birthplace: West Virginia
Marital Status: Married
Relation to Head of House: Head
Home in 1940: Reno, Preston, West Virginia
Map of Home in 1940: Reno, Preston, West Virginia
Farm: Yes
Inferred Residence in 1935: Reno, Preston, West Virginia

Income Other Sources: Yes

https://www.ancestry.com/discoveryui-content/view/144087815:2442

https://www.ancestry.com/discoveryui-content/view/144087815:2442

I've checked dozens of them, and it is always only one url (different ones for each source) that is repeated x3 times.

Since I have more than 15000+ sources, it would be very helpful if  there was an sql statement I could run that would get rid of all but one duplicate in just the source comments note field.

🙂

Matthew

Uploaded files:
  • source_comment_section.PNG
  • repeated-urls.PNG
  • repeated_urls_source_comment_section.PNG

If these duplications consistently occur in Source Comments whose first line begins with "http", it should be easy to find them:

SELECT SourceID, Comment FROM SourceTable
WHERE Comment LIKE 'http%';

Then it would be a matter of finding the position in the Comment of the second such occurrence and extracting everything up to it to a temporary table or view for review.

Is that something you can figure out using the INSTR() and SUBSTR() functions?

I'm not really sure how to do that. As I was looking at them, it dawned on me that the links are not necessary. the http links are the same as the citation webtag (which of course is now also the source webtag).

So if it is easier, can the sql just delete all weblinks from the comments note?

If not, 2 of the 3 weblinks (as shown in the pic) are always after the plain text. Those two would probably be easiest to delete?  I need to find me a good sqlite resource online to learn the syntax better...

Matthew

 

Uploaded files:
  • Capture.PNG

What's the saying... "Give a man a fish, and you feed him for a day; show him how to catch fish, and you feed him for a lifetime." So it's empowering to learn how to code in sqlite and there is an abundance of learning material out there. This site is not a tutorial; it's mainly about applying sqlite to RM so it's great for a new sqlite user to grab and use a prepared script and for an experienced sqlite user to understand the RM database structure sufficiently to roll their own. Have at it!

Here's some code I worked up to address your question, using SQLiteSpy with its integrated REGEXP functions.

DROP VIEW IF EXISTS SourceComments
;
CREATE TEMP VIEW SourceComments AS
SELECT
SourceID
,Comments
,TRIM(REGEXP_REPLACE(Comments,'http.+',''),' '||CHAR(10)||CHAR(13))
AS NewComments
FROM SourceTable
WHERE Comments REGEXP '^http.+'
;

-- You should review the SourceComments table
-- before executing the next statement.

UPDATE SourceTable
SET Comments
= (SELECT NewComments FROM SourceComments SC
WHERE SourceTable.SourceID = SC.SourceID)
WHERE SourceID IN (SELECT SourceID FROM SourceComments)
;

I ran into a limitation in the REGEXP_REPLACE function. I wanted the match pattern to always have "http" at the beginning of the line and to ignore any URLs that start after position 1 in the line. Normally, the pattern '^http.+' would do so but this implementation regex only finds the match in the first line. So I had to drop the "^" start-of-line symbol. Hopefully, you have no URLs other than these that always start their line. I'm going to contact the developer of SQLiteSpy and see if he can do something about it.

This sql code worked perfect for deleting the multiple urls from the source comments field.

The only issue I had is I first tried to run it using SQLite Expert Pro version and it did not recognize the regexp_replace function so I used sqlitespy and it worked fine.

Thanks again,

Matthew