<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>UnixWars &#187; scripts</title>
	<atom:link href="http://unixwars.com/tag/scripts/feed/" rel="self" type="application/rss+xml" />
	<link>http://unixwars.com</link>
	<description>Taher Shihadeh's ragbag</description>
	<lastBuildDate>Mon, 26 Dec 2011 23:38:31 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.2</generator>
		<item>
		<title>epi_sort.py: Filename comparison</title>
		<link>http://unixwars.com/2010/07/03/filename-comparator/</link>
		<comments>http://unixwars.com/2010/07/03/filename-comparator/#comments</comments>
		<pubDate>Sat, 03 Jul 2010 14:08:57 +0000</pubDate>
		<dc:creator>Taher Shihadeh</dc:creator>
				<category><![CDATA[Linux]]></category>
		<category><![CDATA[Projects]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[scripts]]></category>

		<guid isPermaLink="false">http://unixwars.com/?p=667</guid>
		<description><![CDATA[As every Joe Six-pack would do, I usually write a lot of scripts to automate my tasks as much as I can. Most of them aren&#8217;t even worth mentioning, but nevertheless I have been meaning to start posting some of those. I&#8217;ve stumbled upon lots of jewels on the net that seemed worthless to their [...]]]></description>
			<content:encoded><![CDATA[<p>As every Joe Six-pack would do, I usually write a lot of scripts to automate my tasks as much as I can. Most of them aren&#8217;t even worth mentioning, but nevertheless I have been meaning to start posting some of those. I&#8217;ve stumbled upon lots of jewels on the net that seemed worthless to their authors, so if any one gets to use one of mine I&#8217;ll be happy. You never know.</p>
<h4>The problem:</h4>
<p>I have a directory full of unclassified media files, some are duplicates, some aren&#8217;t, and each one follows a different naming convention.</p>
<p>I even try to classify them from time to time, so you can throw some directories into the pack. Sometimes, I even create two or three directories for the same group-series-category-whatever before I realize there is an existing one with a slightly different name. And frequently a lot of files remain unclassified, many of which could fit into one of the directories I mentioned.</p>
<p>Of course &#8230; whenever a new file arrives to my <a href="/tag/home-server/">home server</a>, it gets thrown into that very same directory, so Chaos keeps spreading, as it always does.</p>
<p>To clarify things, lets show an example:</p>
<pre>drwxrwxrwx 1 user user      4096 2010-07-03 11:50 01_Battlestar_Enterprise
drwxrwxrwx 1 user user      4096 2010-07-03 11:42 02_Startrek_Galactica
drwxrwxrwx 1 user user      4096 2010-07-03 11:50 03_battlestar.enterprise-season.1
-rwxrwxrwx 1 user user 220393472 2010-07-03 02:49 battlestar.enterprise.s1e01.avi
-rwxrwxrwx 1 user user 221227008 2010-07-03 02:50 Battlestar_Enterprise_1_22.mp4
-rwxrwxrwx 1 user user 195393472 2010-07-03 02:49 startrek.galactica.4x15.[ripper_22].mkv
</pre>
<p>As you can imagine, sorting things up can get really tedious, and there is no automatic way of doing it that I know of.</p>
<p>I had some time this morning and got fed up with it. Every little piece of help is more than welcome, and here is where Python comes to the rescue.</p>
<h4>The solution:</h4>
<p>There are dozens of ways to do this, but I ended up coding a quick hack to help me sort things out.<br />
It just compares the names of files and directories, and estimates the similarities. Anything above a 50% match is usually correctly estimated.</p>
<pre class="prettyprint">#!/usr/bin/env python
# -*- coding: utf-8 -*-

# (C) 2010, Taher Shihadeh
# Licensed: GPL v2

"""
The script works based only on names of files and directories in a
non-recursive manner.

It takes a path as parameter and tries to determine if the names of
the contents look alike.

It removes separator characters, numbers and file extensions prior to
the comparison.
"""

import os
import sys
import string
from operator import itemgetter

FAST = False # Change this to skip file-to-file comparisons
SEP  = '_-+~.·:;·()[]¡!¿?<>'

def main (path):
    lst1    = os.listdir (path)
    lst2    = lst1
    len_lst = len(lst1)
    count   = 0.0
    results = []

    for x in lst1:
        for y in lst2:
            if x==y:
                continue
            x_dir = os.path.isdir(x)
            y_dir = os.path.isdir(y)

            if FAST and not (x_dir or y_dir):
                continue

            result = {'A': (x, x_dir), 'B': (y, y_dir)}

            str1, str2 = x, y
            if not x_dir:
                str1,_ = os.path.splitext (x)
            if not y_dir:
                str2,_ = os.path.splitext (y)

            result['factor'] = compare (str1,str2)
            results.append(result)

        lst2.remove(x)
        count += 1
        print >> sys.stderr, '%.2f%% done' %((count / len_lst)*200)

    show(results)

def split (str1):
    trans = string.maketrans(SEP, ' '*len(SEP))
    return str1.translate(trans).split()

def clean (lst):
    assert type(lst) == list
    return filter(lambda x: not x.isdigit(), lst)

def compare (str1, str2):
    """Return similarity factor as percentage"""
    aux1 = clean (split (str1.lower()))
    aux2 = clean (split (str2.lower()))

    set_or  = set(aux1) | set(aux2)
    set_and = set(aux1) &amp; set(aux2)

    return (float(len(set_and)) / float(len(set_or)))*100

def show (results):
    """Show most similar last"""
    for x in sorted(results, key=itemgetter('factor')):
        a,b = x['A'],x['B']
        if not b[1] and a[1]:
            a,b = b,a
        print '%.2f \t %s \t --> %s' %(x['factor'], a[0], b[0])

if __name__=='__main__':
    try:
        path = sys.argv[1]
    except IndexError:
        path = os.getcwd()

    main (path)
</pre>
<p>I don&#8217;t think any one is going to use it, but what the hell. It&#8217;s a big Internet ;-)</p>
]]></content:encoded>
			<wfw:commentRss>http://unixwars.com/2010/07/03/filename-comparator/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
	</channel>
</rss>

