Ok, so I wasn't sure to which forum to post this to, but decided it probably fits best here...
If you - like me - have a huge collection which grew over the years, that you have entered into your MyMovies DB within a very short time, you probably have accumulated some duplicate titles. May it be that this new Special Edition of your favourite movie came out that you had to get even though you already had the bare-bones release, or that now a film for the first time was released in HD on Blu-ray...
How to get a handle on these duplicates? How to find them easily, so that the ones you don't want anymore can be moved to the "for sale" group? This might be very easy if you have bought all your discs in the same country. Then you can just sort them for their title and all the duplicates will appear nicely sorted.
But if you're like me and have bought your discs from all over the world, it's not that easy. Unfortunately, the MyMovies collection management does not allow me to sort for the original title of the IMDB id. BTW, this is a feature I would really love: the ability to sort or filter for just anything (hint, hint).
So what to do then? Luckily, the good people at Binnerup Consult have provided us with a read-only DB user which allows us to directly connect to the MyMovies Collection Management Database and snoop around. And this is a very powerful possibility for all kinds of things we want to find out about our collection.
Prerequisites: what do you need?* MyMovies Collection Management on Windows (sorry, but I don't know about the Mac version, but it might be similar)
* Microsoft SQL Server 2012 Management Studio. This can be obtained for free from Microsoft, just google for it. It seems that you need to install the whole SQL Server along with this to get the Studio, which really sucks...
BTW, I tried at first to connect to the DB using the Eclipse Data Tools Platform, but I couldn't get it to work without constantly getting errors :(
So, where's my data?Once you have the Management Studio installed and started you need to connect to your MyMovies DB. This should already be running in the background, so with the
DB credentials provided in the Wiki you can just connect to the DB:
Quote:Server name: [Server]\MYMOVIES
Authentication: SQL Server Authentication
Login: mymovies
Password: 7eBrABud
After that you can snoop around in the DB tables by using the tree view and looking into the "My Movies" node.
The main data entries for all the titles in your collection are in the "[dbo].[tblTitles]" table. What's of particular interest to us are the intId column (the title ID number), the nvcLocalTitle which contains the title for the local disc release, the nvcOriginalTitle with the original movie title and the nvcIMDB column which contains the IMDB ID that we want to use to find duplicates.
Let's try a first query:
Code:SELECT [intId], [nvcLocalTitle], [nvcOriginalTitle], [nvcIMDB] FROM [My Movies].[dbo].[tblTitles]
This will give us a list like in the attached screenshot (SimplyQuery01).
But we want to find groupings of titles with the same IMDB ID. So let's try sorting for it:
Code:SELECT [intId], [nvcLocalTitle], [nvcOriginalTitle], [nvcIMDB] FROM [My Movies].[dbo].[tblTitles] ORDER BY [nvcIMDB]
Now, there are a lot of entries at the top with no IMDB ID at all followed by the universal "tt0000000" entry for stuff that has no IMDB entry or is a box set container (OrderedQuery01, OrderedQuery02). The empty ones are a topic for another day. Right now we want to get rid of those in the list, so let's add a "WHERE" clause to the query which excludes those:
Code:SELECT [intId], [nvcLocalTitle], [nvcOriginalTitle], [nvcIMDB] FROM [My Movies].[dbo].[tblTitles]
WHERE [nvcIMDB] != '' AND [nvcIMDB] != 'tt0000000'
ORDER BY [nvcIMDB]
The result (OrderedQueryWithWhere01) already looks better, but we still haven't accounted for the titles which we have already moved to the "for sale" group. I want to only see those which I have put in the "for keeps" group. Well, for this we need to look at yet another database table, because this is part of the personal data and is not stored in the main table for the title. After a bit of looking around we find the [tblTitlePersonal] table with a column [intGroup], which contains a code for the several groups. I found that 1 means "for keeps", so that's what we are going to restrict the output to. To do this we need to "join" the tables in the query like this:
Code:SELECT t.[intId], t.[nvcLocalTitle], t.[nvcOriginalTitle], t.[nvcIMDB]
FROM [My Movies].[dbo].[tblTitles] t, [My Movies].[dbo].[tblTitlePersonal] p
WHERE t.[nvcIMDB] != '' AND t.[nvcIMDB] != 'tt0000000' AND t.[intId] = p.[intTitle] AND p.[intGroup] = 1
ORDER BY t.[nvcIMDB]
This give us a slightly shorter result list (OrderedQueryWithWhere02), but it's still far too cumbersome to go through this huge list where in fact most of the entries are unique concerning the IMDB ID anyway. After all, we're just interested in those where we have the same film more than once.
So how do we find those titles? The answer is the "GROUP BY" clause which allows us to aggregate data across several entries which share the same characteristics, like, in our case, the same IMDB ID. This clause allows us also to restrict the output to those results where there is more than one. For exmaple, this query:
Code:SELECT t.[nvcIMDB], COUNT(t.[nvcIMDB])
FROM [My Movies].[dbo].[tblTitles] t, [My Movies].[dbo].[tblTitlePersonal] p
WHERE t.[nvcIMDB] != '' AND t.[nvcIMDB] != 'tt0000000' AND t.[intId] = p.[intTitle] AND p.[intGroup] = 1
GROUP BY t.[nvcIMDB]
HAVING ( COUNT(t.[nvcIMDB])>1 )
ORDER BY COUNT(t.[nvcIMDB]) DESC
will give us an account of all the entries and the number of occurences, excluding the ones which appear only once. And the list is sorted by the number of occurences (SortedDuplicates01).
But while this is nice, what is not so nice is that the titles of the movies are now missing. We can't just add them in the output by adding the title columns in the SELECT statement, because this would break the grouping: the titles might vary and therefore make us miss some duplicates we have.
However, I found two possibilities to get the titles back in (and I'm sure there are more), and which one you pick is entirely up to your personal taste. The first one goes about concatenating all the titles for each line of the grouped output in a single column. I foudn this solution on Stackoverflow, and it seems that it kind of misuses a construct of the SQL Server's SQL dialect for something it was not really intended for. So please don't ask me for specifics, but rather read about it
here . This is the query:
Code:SELECT t.[nvcIMDB],
STUFF((
SELECT ' --- ' + CAST([nvcOriginalTitle] AS VARCHAR(MAX))
FROM [dbo].[tblTitles] WHERE ([nvcIMDB] = t.[nvcIMDB])
FOR XML PATH(''),TYPE
/* Use .value to uncomment XML entities e.g. > < etc*/
).value('.','VARCHAR(MAX)')
,1,2,'') as NameValues
,COUNT (t.[nvcIMDB])
FROM [My Movies].[dbo].[tblTitles] t, [My Movies].[dbo].[tblTitlePersonal] p
WHERE t.[nvcIMDB] != '' AND t.[nvcIMDB] != 'tt0000000' AND t.[intId] = p.[intTitle] AND p.[intGroup] = 1
GROUP BY t.[nvcIMDB]
HAVING ( COUNT(t.[nvcIMDB])>1 )
ORDER BY COUNT(t.[nvcIMDB]) DESC
And the result looks like in screenshot (SortedDuplicates02). As you can see at the top you see also a few false positives, namely tv series where for every season box the same IMDB ID is entered.
The advantage is that you only get one line per IMDB ID which is there more than once in your collection with the titles of the entries next to it.
If you prefer to have one line per MyMovies entry and have all entries for the same IMDB ID in order you can also use a nesteq query like this:
Code:SELECT [intId], [nvcIMDB], [nvcLocalTitle], [nvcOriginalTitle] FROM [My Movies].[dbo].[tblTitles]
WHERE ([nvcIMDB] IN
(SELECT t.[nvcIMDB]
FROM [My Movies].[dbo].[tblTitles] t, [My Movies].[dbo].[tblTitlePersonal] p
WHERE t.[nvcIMDB] != '' AND t.[nvcIMDB] != 'tt0000000' AND t.[intId] = p.[intTitle] AND p.[intGroup] = 1
GROUP BY t.[nvcIMDB]
HAVING ( COUNT(t.[nvcIMDB])>1 )))
ORDER BY [nvcIMDB], [intId]
This will give you a list like in (NestedQuery01).
The drawback of this method is that it's not as easy to sort descending for the IMDB IDs with the most entries, and also that you can't print out the count for the number of entries.
However The advantage is that you don't need to scan for the titles in the concatenated line and that you can easily add additional information to the output, like the type:
Code:SELECT title.[intId], title.[nvcIMDB], title.[nvcLocalTitle], title.[nvcOriginalTitle], typ.[nvcName]
FROM [My Movies].[dbo].[tblTitles] title, [My Movies].[dbo].[tblTypes] typ
WHERE (title.[nvcIMDB] IN
(SELECT t.[nvcIMDB]
FROM [My Movies].[dbo].[tblTitles] t, [My Movies].[dbo].[tblTitlePersonal] p
WHERE t.[nvcIMDB] != '' AND t.[nvcIMDB] != 'tt0000000' AND t.[intId] = p.[intTitle] AND p.[intGroup] = 1
GROUP BY t.[nvcIMDB]
HAVING ( COUNT(t.[nvcIMDB])>1 )))
AND title.[intType] = typ.[intId]
ORDER BY title.[nvcIMDB], title.[intId]
I find the result (NestedQuery02) rather useful for my purposes, and maybe you do to.
As you can see: there are many possibilities to explore your collection in more specific ways than the Collection Management can provide, and maybe I could wet your appetite to try for yourself. Don't worry: you can't break anything! The DB user we have at our disposal has read-only access. So just start exploring!
Best regards,
Oliver
el guapo attached the following image(s):








