My Movies - Forum

Welcome Guest

Search | Active Topics | Members | Log In | Register

My Movies » Other Topics » Third Party Software » Cleaning Up Duplicates in Your Collection

Cleaning Up Duplicates in Your Collection

Options · View

Previous Topic · Next Topic

el guapo

Posted: Monday, February 11, 2013 12:13:20 AM

Groups: Member

Joined: 12/10/2011
Posts: 80

Rank:

(9177)

Ok, so I wasn't sure to which forum to post this to, but decided it probably fits best here...

If you - like me - have a huge collection which grew over the years, that you have entered into your MyMovies DB within a very short time, you probably have accumulated some duplicate titles. May it be that this new Special Edition of your favourite movie came out that you had to get even though you already had the bare-bones release, or that now a film for the first time was released in HD on Blu-ray...

How to get a handle on these duplicates? How to find them easily, so that the ones you don't want anymore can be moved to the "for sale" group? This might be very easy if you have bought all your discs in the same country. Then you can just sort them for their title and all the duplicates will appear nicely sorted.

But if you're like me and have bought your discs from all over the world, it's not that easy. Unfortunately, the MyMovies collection management does not allow me to sort for the original title of the IMDB id. BTW, this is a feature I would really love: the ability to sort or filter for just anything (hint, hint).

So what to do then? Luckily, the good people at Binnerup Consult have provided us with a read-only DB user which allows us to directly connect to the MyMovies Collection Management Database and snoop around. And this is a very powerful possibility for all kinds of things we want to find out about our collection.

Prerequisites: what do you need?
* MyMovies Collection Management on Windows (sorry, but I don't know about the Mac version, but it might be similar)
* Microsoft SQL Server 2012 Management Studio. This can be obtained for free from Microsoft, just google for it. It seems that you need to install the whole SQL Server along with this to get the Studio, which really sucks...

BTW, I tried at first to connect to the DB using the Eclipse Data Tools Platform, but I couldn't get it to work without constantly getting errors :(

So, where's my data?

Once you have the Management Studio installed and started you need to connect to your MyMovies DB. This should already be running in the background, so with the DB credentials provided in the Wiki you can just connect to the DB:

Quote:

Server name: [Server]\MYMOVIES
Authentication: SQL Server Authentication
Login: mymovies
Password: 7eBrABud

After that you can snoop around in the DB tables by using the tree view and looking into the "My Movies" node.
The main data entries for all the titles in your collection are in the "[dbo].[tblTitles]" table. What's of particular interest to us are the intId column (the title ID number), the nvcLocalTitle which contains the title for the local disc release, the nvcOriginalTitle with the original movie title and the nvcIMDB column which contains the IMDB ID that we want to use to find duplicates.

Let's try a first query:

Code:

SELECT [intId], [nvcLocalTitle], [nvcOriginalTitle], [nvcIMDB] FROM [My Movies].[dbo].[tblTitles]

This will give us a list like in the attached screenshot (SimplyQuery01).

But we want to find groupings of titles with the same IMDB ID. So let's try sorting for it:

Code:

SELECT [intId], [nvcLocalTitle], [nvcOriginalTitle], [nvcIMDB] FROM [My Movies].[dbo].[tblTitles] ORDER BY [nvcIMDB]

Now, there are a lot of entries at the top with no IMDB ID at all followed by the universal "tt0000000" entry for stuff that has no IMDB entry or is a box set container (OrderedQuery01, OrderedQuery02). The empty ones are a topic for another day. Right now we want to get rid of those in the list, so let's add a "WHERE" clause to the query which excludes those:

Code:

SELECT [intId], [nvcLocalTitle], [nvcOriginalTitle], [nvcIMDB] FROM [My Movies].[dbo].[tblTitles]
WHERE [nvcIMDB] != '' AND [nvcIMDB] != 'tt0000000'
ORDER BY [nvcIMDB]

The result (OrderedQueryWithWhere01) already looks better, but we still haven't accounted for the titles which we have already moved to the "for sale" group. I want to only see those which I have put in the "for keeps" group. Well, for this we need to look at yet another database table, because this is part of the personal data and is not stored in the main table for the title. After a bit of looking around we find the [tblTitlePersonal] table with a column [intGroup], which contains a code for the several groups. I found that 1 means "for keeps", so that's what we are going to restrict the output to. To do this we need to "join" the tables in the query like this:

Code:

SELECT t.[intId], t.[nvcLocalTitle], t.[nvcOriginalTitle], t.[nvcIMDB]
FROM [My Movies].[dbo].[tblTitles] t, [My Movies].[dbo].[tblTitlePersonal] p
WHERE t.[nvcIMDB] != '' AND t.[nvcIMDB] != 'tt0000000' AND t.[intId] = p.[intTitle] AND p.[intGroup] = 1
ORDER BY t.[nvcIMDB]

This give us a slightly shorter result list (OrderedQueryWithWhere02), but it's still far too cumbersome to go through this huge list where in fact most of the entries are unique concerning the IMDB ID anyway. After all, we're just interested in those where we have the same film more than once.

So how do we find those titles? The answer is the "GROUP BY" clause which allows us to aggregate data across several entries which share the same characteristics, like, in our case, the same IMDB ID. This clause allows us also to restrict the output to those results where there is more than one. For exmaple, this query:

Code:

SELECT t.[nvcIMDB], COUNT(t.[nvcIMDB])
FROM [My Movies].[dbo].[tblTitles] t, [My Movies].[dbo].[tblTitlePersonal] p
WHERE t.[nvcIMDB] != '' AND t.[nvcIMDB] != 'tt0000000' AND t.[intId] = p.[intTitle] AND p.[intGroup] = 1
GROUP BY t.[nvcIMDB]
HAVING ( COUNT(t.[nvcIMDB])>1 )
ORDER BY COUNT(t.[nvcIMDB]) DESC

will give us an account of all the entries and the number of occurences, excluding the ones which appear only once. And the list is sorted by the number of occurences (SortedDuplicates01).

But while this is nice, what is not so nice is that the titles of the movies are now missing. We can't just add them in the output by adding the title columns in the SELECT statement, because this would break the grouping: the titles might vary and therefore make us miss some duplicates we have.

However, I found two possibilities to get the titles back in (and I'm sure there are more), and which one you pick is entirely up to your personal taste. The first one goes about concatenating all the titles for each line of the grouped output in a single column. I foudn this solution on Stackoverflow, and it seems that it kind of misuses a construct of the SQL Server's SQL dialect for something it was not really intended for. So please don't ask me for specifics, but rather read about it here . This is the query:

Code:

SELECT t.[nvcIMDB],
STUFF((
SELECT ' --- ' + CAST([nvcOriginalTitle] AS VARCHAR(MAX))
FROM [dbo].[tblTitles] WHERE ([nvcIMDB] = t.[nvcIMDB])
FOR XML PATH(''),TYPE
/* Use .value to uncomment XML entities e.g. > < etc*/
).value('.','VARCHAR(MAX)')
,1,2,'') as NameValues
,COUNT (t.[nvcIMDB])
FROM [My Movies].[dbo].[tblTitles] t, [My Movies].[dbo].[tblTitlePersonal] p
WHERE t.[nvcIMDB] != '' AND t.[nvcIMDB] != 'tt0000000' AND t.[intId] = p.[intTitle] AND p.[intGroup] = 1
GROUP BY t.[nvcIMDB]
HAVING ( COUNT(t.[nvcIMDB])>1 )
ORDER BY COUNT(t.[nvcIMDB]) DESC

And the result looks like in screenshot (SortedDuplicates02). As you can see at the top you see also a few false positives, namely tv series where for every season box the same IMDB ID is entered.

The advantage is that you only get one line per IMDB ID which is there more than once in your collection with the titles of the entries next to it.

If you prefer to have one line per MyMovies entry and have all entries for the same IMDB ID in order you can also use a nesteq query like this:

Code:

SELECT [intId], [nvcIMDB], [nvcLocalTitle], [nvcOriginalTitle] FROM [My Movies].[dbo].[tblTitles]
WHERE ([nvcIMDB] IN
(SELECT t.[nvcIMDB]
FROM [My Movies].[dbo].[tblTitles] t, [My Movies].[dbo].[tblTitlePersonal] p
WHERE t.[nvcIMDB] != '' AND t.[nvcIMDB] != 'tt0000000' AND t.[intId] = p.[intTitle] AND p.[intGroup] = 1
GROUP BY t.[nvcIMDB]
HAVING ( COUNT(t.[nvcIMDB])>1 )))
ORDER BY [nvcIMDB], [intId]

This will give you a list like in (NestedQuery01).

The drawback of this method is that it's not as easy to sort descending for the IMDB IDs with the most entries, and also that you can't print out the count for the number of entries.

However The advantage is that you don't need to scan for the titles in the concatenated line and that you can easily add additional information to the output, like the type:

Code:

SELECT title.[intId], title.[nvcIMDB], title.[nvcLocalTitle], title.[nvcOriginalTitle], typ.[nvcName]
FROM [My Movies].[dbo].[tblTitles] title, [My Movies].[dbo].[tblTypes] typ
WHERE (title.[nvcIMDB] IN
(SELECT t.[nvcIMDB]
FROM [My Movies].[dbo].[tblTitles] t, [My Movies].[dbo].[tblTitlePersonal] p
WHERE t.[nvcIMDB] != '' AND t.[nvcIMDB] != 'tt0000000' AND t.[intId] = p.[intTitle] AND p.[intGroup] = 1
GROUP BY t.[nvcIMDB]
HAVING ( COUNT(t.[nvcIMDB])>1 )))
AND title.[intType] = typ.[intId]
ORDER BY title.[nvcIMDB], title.[intId]

I find the result (NestedQuery02) rather useful for my purposes, and maybe you do to.

As you can see: there are many possibilities to explore your collection in more specific ways than the Collection Management can provide, and maybe I could wet your appetite to try for yourself. Don't worry: you can't break anything! The DB user we have at our disposal has read-only access. So just start exploring!

Best regards,
Oliver

el guapo attached the following image(s):

mrwonker

Posted: Monday, February 11, 2013 7:24:20 AM

Groups: Member

Joined: 1/10/2010
Posts: 856

Rank:

(1047)

Interesting post however If I wanted to find duplicates I would just export the database as an .xml and open with excel from there it is easy to do many things including finding duplicates.

my advice is given to help, my advice might not always be the best solution but is given to the best of my knowledge, my advice is not always approved by MyMovies and often isn't, if my advice ever suggests changing in any way a title from what is contributed on the webservice please insure no invalid changes are contributed, please read the following guides Disc Title Contribution Guide, Contribution Introduction, TV Series Contribution Guide

el guapo

Posted: Monday, February 11, 2013 10:08:43 PM

Groups: Member

Joined: 12/10/2011
Posts: 80

Rank:

(9177)

mrwonker wrote:

Interesting post however If I wanted to find duplicates I would just export the database as an .xml and open with excel from there it is easy to do many things including finding duplicates.

I would love to hear more about that. I'm not an Excel power user (in fact, I only use it at the office and use a freeware pffice suite at home).

I would imagine that it's easy to get the title data into Excel, but I have no idea on how to easily find such duplicates...

One advantage of the direct access to the DB is that you don't need to go through the export/import steps again to rerun your query...

Best regards,
Oliver

el guapo

Posted: Wednesday, February 13, 2013 12:32:28 AM

Groups: Member

Joined: 12/10/2011
Posts: 80

Rank:

(9177)

Hi again,
I just realised that my statement contained an error: the Id I printed out is not the collection number you see in the collection management, but an internal ID which is not shown in the UI at all. Since the collection number is part of your personal data (because it's different for each user who has the same title in the collection), this needs to be joined from the table containing the personal data like this:

Code:

SELECT title.[intId], pers.[intCollectionNumber], title.[nvcIMDB], title.[nvcLocalTitle], title.[nvcOriginalTitle], typ.[nvcName]
FROM [My Movies].[dbo].[tblTitles] title
, [My Movies].[dbo].[tblTypes] typ
, [My Movies].[dbo].[tblTitlePersonal] pers
WHERE (title.[nvcIMDB] IN
(SELECT t.[nvcIMDB]
FROM [My Movies].[dbo].[tblTitles] t, [My Movies].[dbo].[tblTitlePersonal] p
WHERE t.[nvcIMDB] != '' AND t.[nvcIMDB] != 'tt0000000' AND t.[intId] = p.[intTitle] AND p.[intGroup] = 1
GROUP BY t.[nvcIMDB]
HAVING ( COUNT(t.[nvcIMDB])>1 )))
AND title.[intType] = typ.[intId]
AND title.[intId] = pers.[intTitle]
ORDER BY title.[nvcIMDB], title.[intId]

However, I found that some of the duplicates I found this way seem to be false positives, because for some entries I have two entries in the personal data table for the same collection number. I started an inquery over in the support forum for CM to find out whether this means that my DB is corrupted, or if this is normal and I have to check for something in the records to see if it is a valid one (although I don't see why an invalid one should be kept around...)

Best regards,
Oliver

Users browsing this topic

guest

Forum Jump

You cannot post new topics in this forum.
You cannot reply to topics in this forum.
You cannot delete your posts in this forum.
You cannot edit your posts in this forum.
You cannot create polls in this forum.
You cannot vote in polls in this forum.