epi_sort.py: Filename comparison
As every Joe Six-pack would do, I usually write a lot of scripts to automate my tasks as much as I can. Most of them aren’t even worth mentioning, but nevertheless I have been meaning to start posting some of those. I’ve stumbled upon lots of jewels on the net that seemed worthless to their authors, so if any one gets to use one of mine I’ll be happy. You never know.
The problem:
I have a directory full of unclassified media files, some are duplicates, some aren’t, and each one follows a different naming convention.
I even try to classify them from time to time, so you can throw some directories into the pack. Sometimes, I even create two or three directories for the same group-series-category-whatever before I realize there is an existing one with a slightly different name. And frequently a lot of files remain unclassified, many of which could fit into one of the directories I mentioned.
Of course … whenever a new file arrives to my home server, it gets thrown into that very same directory, so Chaos keeps spreading, as it always does.
To clarify things, lets show an example:
drwxrwxrwx 1 user user 4096 2010-07-03 11:50 01_Battlestar_Enterprise drwxrwxrwx 1 user user 4096 2010-07-03 11:42 02_Startrek_Galactica drwxrwxrwx 1 user user 4096 2010-07-03 11:50 03_battlestar.enterprise-season.1 -rwxrwxrwx 1 user user 220393472 2010-07-03 02:49 battlestar.enterprise.s1e01.avi -rwxrwxrwx 1 user user 221227008 2010-07-03 02:50 Battlestar_Enterprise_1_22.mp4 -rwxrwxrwx 1 user user 195393472 2010-07-03 02:49 startrek.galactica.4x15.[ripper_22].mkv
As you can imagine, sorting things up can get really tedious, and there is no automatic way of doing it that I know of.
I had some time this morning and got fed up with it. Every little piece of help is more than welcome, and here is where Python comes to the rescue.
The solution:
There are dozens of ways to do this, but I ended up coding a quick hack to help me sort things out.
It just compares the names of files and directories, and estimates the similarities. Anything above a 50% match is usually correctly estimated.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# (C) 2010, Taher Shihadeh
# Licensed: GPL v2
"""
The script works based only on names of files and directories in a
non-recursive manner.
It takes a path as parameter and tries to determine if the names of
the contents look alike.
It removes separator characters, numbers and file extensions prior to
the comparison.
"""
import os
import sys
import string
from operator import itemgetter
FAST = False # Change this to skip file-to-file comparisons
SEP = '_-+~.·:;·()[]¡!¿?<>'
def main (path):
lst1 = os.listdir (path)
lst2 = lst1
len_lst = len(lst1)
count = 0.0
results = []
for x in lst1:
for y in lst2:
if x==y:
continue
x_dir = os.path.isdir(x)
y_dir = os.path.isdir(y)
if FAST and not (x_dir or y_dir):
continue
result = {'A': (x, x_dir), 'B': (y, y_dir)}
str1, str2 = x, y
if not x_dir:
str1,_ = os.path.splitext (x)
if not y_dir:
str2,_ = os.path.splitext (y)
result['factor'] = compare (str1,str2)
results.append(result)
lst2.remove(x)
count += 1
print >> sys.stderr, '%.2f%% done' %((count / len_lst)*200)
show(results)
def split (str1):
trans = string.maketrans(SEP, ' '*len(SEP))
return str1.translate(trans).split()
def clean (lst):
assert type(lst) == list
return filter(lambda x: not x.isdigit(), lst)
def compare (str1, str2):
"""Return similarity factor as percentage"""
aux1 = clean (split (str1.lower()))
aux2 = clean (split (str2.lower()))
set_or = set(aux1) | set(aux2)
set_and = set(aux1) & set(aux2)
return (float(len(set_and)) / float(len(set_or)))*100
def show (results):
"""Show most similar last"""
for x in sorted(results, key=itemgetter('factor')):
a,b = x['A'],x['B']
if not b[1] and a[1]:
a,b = b,a
print '%.2f \t %s \t --> %s' %(x['factor'], a[0], b[0])
if __name__=='__main__':
try:
path = sys.argv[1]
except IndexError:
path = os.getcwd()
main (path)
I don’t think any one is going to use it, but what the hell. It’s a big Internet ;-)
You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.
Dude. Found this site by doing a Google search for ‘python string.translate’.
Left this site with a script that organized hundreds of my media files in more recognizable order.
Would you mind if I play around with this script in wxPython? A GUI Windows binary version of this would be extremely useful to many.
Be my guest. I’d love to take a look at that!! :)
I’ve had an almost-finished improved version of this sitting on my hard drive for quite some time. I’ll test it this weekend and post it over here, just in case you want to take off from there instead.
I had a hard drive crash not long ago, and it has been a lifesaver to organize all the recovered media.