Latest Publications

epi_sort.py: Filename comparison

As every Joe Six-pack would do, I usually write a lot of scripts to automate my tasks as much as I can. Most of them aren’t even worth mentioning, but nevertheless I have been meaning to start posting some of those. I’ve stumbled upon lots of jewels on the net that seemed worthless to their authors, so if any one gets to use one of mine I’ll be happy. You never know.

The problem:

I have a directory full of unclassified media files, some are duplicates, some aren’t, and each one follows a different naming convention.

I even try to classify them from time to time, so you can throw some directories into the pack. Sometimes, I even create two or three directories for the same group-series-category-whatever before I realize there is an existing one with a slightly different name. And frequently a lot of files remain unclassified, many of which could fit into one of the directories I mentioned.

Of course … whenever a new file arrives to my home server, it gets thrown into that very same directory, so Chaos keeps spreading, as it always does.

To clarify things, lets show an example:

drwxrwxrwx 1 user user      4096 2010-07-03 11:50 01_Battlestar_Enterprise
drwxrwxrwx 1 user user      4096 2010-07-03 11:42 02_Startrek_Galactica
drwxrwxrwx 1 user user      4096 2010-07-03 11:50 03_battlestar.enterprise-season.1
-rwxrwxrwx 1 user user 220393472 2010-07-03 02:49 battlestar.enterprise.s1e01.avi
-rwxrwxrwx 1 user user 221227008 2010-07-03 02:50 Battlestar_Enterprise_1_22.mp4
-rwxrwxrwx 1 user user 195393472 2010-07-03 02:49 startrek.galactica.4x15.[ripper_22].mkv

As you can imagine, sorting things up can get really tedious, and there is no automatic way of doing it that I know of.

I had some time this morning and got fed up with it. Every little piece of help is more than welcome, and here is where Python comes to the rescue.

The solution:

There are dozens of ways to do this, but I ended up coding a quick hack to help me sort things out.
It just compares the names of files and directories, and estimates the similarities. Anything above a 50% match is usually correctly estimated.

#!/usr/bin/env python
# -*- coding: utf-8 -*-

# (C) 2010, Taher Shihadeh
# Licensed: GPL v2

"""
The script works based only on names of files and directories in a
non-recursive manner.

It takes a path as parameter and tries to determine if the names of
the contents look alike.

It removes separator characters, numbers and file extensions prior to
the comparison.
"""

import os
import sys
import string
from operator import itemgetter

FAST = False # Change this to skip file-to-file comparisons
SEP  = '_-+~.·:;·()[]¡!¿?<>'

def main (path):
    lst1    = os.listdir (path)
    lst2    = lst1
    len_lst = len(lst1)
    count   = 0.0
    results = []

    for x in lst1:
        for y in lst2:
            if x==y:
                continue
            x_dir = os.path.isdir(x)
            y_dir = os.path.isdir(y)

            if FAST and not (x_dir or y_dir):
                continue

            result = {'A': (x, x_dir), 'B': (y, y_dir)}

            str1, str2 = x, y
            if not x_dir:
                str1,_ = os.path.splitext (x)
            if not y_dir:
                str2,_ = os.path.splitext (y)

            result['factor'] = compare (str1,str2)
            results.append(result)

        lst2.remove(x)
        count += 1
        print >> sys.stderr, '%.2f%% done' %((count / len_lst)*200)

    show(results)

def split (str1):
    trans = string.maketrans(SEP, ' '*len(SEP))
    return str1.translate(trans).split()

def clean (lst):
    assert type(lst) == list
    return filter(lambda x: not x.isdigit(), lst)

def compare (str1, str2):
    """Return similarity factor as percentage"""
    aux1 = clean (split (str1.lower()))
    aux2 = clean (split (str2.lower()))

    set_or  = set(aux1) | set(aux2)
    set_and = set(aux1) & set(aux2)

    return (float(len(set_and)) / float(len(set_or)))*100

def show (results):
    """Show most similar last"""
    for x in sorted(results, key=itemgetter('factor')):
        a,b = x['A'],x['B']
        if not b[1] and a[1]:
            a,b = b,a
        print '%.2f \t %s \t --> %s' %(x['factor'], a[0], b[0])

if __name__=='__main__':
    try:
        path = sys.argv[1]
    except IndexError:
        path = os.getcwd()

    main (path)

I don’t think any one is going to use it, but what the hell. It’s a big Internet ;-)

Marketing budgets

This entry is by no means technical, but it shows perfectly the vast difference in budget available to two very well known companies: Google and Opera.

I just stumbled upon these two videos. One of them is almost three months old. Although it doesn’t prove much, it is quite spectacular with its fancy high speed camera at 2700 shots per second.

Chrome versus Potato

The other one … well… it isn’t as spectacular as Chrome’s. Seriously, it isn’t. But …. OMG!!! This is genius. Exactly as scientific as the first one. Not as visually appealing. But tomorrow morning I’ll still be laughing.

Opera versus Potato (Parody)

Cherokee Summit 2010: Mission accomplished

We’ve been working in frenzy since last week. Not that we usually don’t, but this was something more. The Cherokee Summit just took place last weekend, and among other things we released our latest and greatest Cherokee v1.0, we defined the roadmap for v2.0, we shared knowledge with some of the most impressive experts in High Availability I’ve ever met, and above all, we had the chance to meet face to face. Our Community is, without a doubt, stronger than ever. The summit has been a great success. We had people attending from all over the World, all levels of expertise, and even from all ages. On this photo you can see Alvaro and the youngest attendee.

MG 6377 Cherokee Summit 2010: Mission accomplished

Everything was recorded, so we will upload the slides and videos of all our sessions really soon. For now, only the photo gallery is available. Take a look at the mugshots.

MG 6058 Cherokee Summit 2010: Mission accomplished

I’m really glad we could make this Summit. It surpased all my expectations. By far. It was an unbelievable experience, and we had lots of fun. Take a look at our family photo. If you want to know which of the guys above is me, here’s a clue: “In brightest day…”.

I’m really looking forward to the next summit. Cherokee Summit 2010 was awesome. I’m sure the next one will be even better.

Countdown to Cherokee Summit 2010

Only one more week to go!
I’m going to remind you all about the first Cherokee Summit. It will be held next week in Madrid (7-8 May), and I’m really excited about it. We will release Cherokee 1.0, will rub shoulders with many members of our community, and we’ll define the road-map for Cherokee 2.0. I’ll be giving a tech-talk along Jonathan Hernandez, so you know when and where to find me.

I’m sure that meeting many of the developers of Cherokee in person will be the highlight for me.

If by any chance you’ll be in Madrid that weekend, don’t forget to register in time and join us.

It’s gonna be legen… wait for it… dary!

European Space Agency

Org-mode to the rescue

It’s been a while since I started using Org-mode. Like four months or so. When I discovered it I knew I would blog about it sooner or later, but I didn’t want to rush things.

Before writing about it,  I wanted to give it a run to see if it could be of any help to a rather absentminded guy. I’m sure many long time Emacs users out there are forgetful at times. I know I am. It seems to fit the profile somehow ;-)

Since I couldn’t rely too much on my memory for these things, I had to find a task management solution. That’s where Org-mode comes in.

If you are like me, maybe Org-mode can save the day. I seem to be able to organize my time a lot better since I started using it.

Org-mode is a mode for keeping notes,  ToDo lists, and project planning in Emacs, with a fast and effective plain-text system. It seems awfully spartan  and simplistic at first, but it is nothing less than magnificent in features. Being a part of Emacs is also a plus for me, since it is the first thing I install on any platform I happen to be working. Besides the OS independence, not being tied at all to a particular application does get extra points. Formats may vary over time, but plain text files are here to stay.

These days I’m using it as an outliner, as a note-taking application, to manage my accounting and, most importantly, as a Getting Things Done (GTD) tool. I don’t quite yet use it for Web and PDF Authoring, but it never hurts to know I could if I wanted.

And for now the deal is working pretty well for me. It is very flexible, has lots of other uses, and also a very rich and knowledgable community, so I totally recommend you take a look at some of the links of this post. It will be worth your while.

It’s official: Cherokee Summit 2010 is on its way!

It is no secret that our Cherokee-Project Community has been growing steadily and relentlessly over the last couple of years. In fact, it has been doing so well that we’ve reached a point where holding a conference about the project actually makes a lot of sense. A lot of people have been asking about this, and after a lot of work we are ready to announce our first Summit, to be held on May 7th-8-th.

cherokee summit 2001 img1 Its official: Cherokee Summit 2010 is on its way!

You can read Alvaro’s announcement, or you can check out the Summit web-site.

Cherokee will be an important topic, but it won’t be the only one. Those will be a couple of days fully dedicated to High Performance and Scalable Web topics, so there’s room for everyone to join in.  We are commited to reaching the 1.0 milestone of Cherokee by then, so we will also have a party to celebrate it.

It’s going to be fun. I’ll be a speaker at the summit and I’m really looking forward to personally meeting many of the members of the project. Thanks to our sponsors we’ve managed to make the event completely free, so don’t forget to register while we still have free spots!

UPDATE: We’ve written a little brochure (~100KB) that can be used to  let your colleagues know about the summit. Do not hesitate to send it to any coworker or friend who would be interested in attending a High Performance and Scalable Web event.

Cherokee screencast season kicks off

On a previous post I introduced our first Cherokee Project screencast. We were going to wait for a new and improved website before we made them public, but what the hell! Why wait? I’m sure the new Cherokee-Project Screencast Collection will come in handy for many of you.

video footage 300x300 Cherokee screencast season kicks off

From here I’d like to thank P.V. Anthony for his invaluable advice on audio production and my old friend Sara Genge for lending her voice to the project (and for her awesome fiction writing, but that is another story).

Creative Commons makes my life better

Creative CommonsI must confess I’m amazed. At this time and age, there are still quite some theoretically influential folks that are convinced that “CC is not even an option” nowadays. I’m not going to point fingers here, but I guess you’ll understand that shit happens  if you live in Spain, like I do.

After all, this is the country where the Government has just spent almost 750K€ as a covert gift to SGAE, our equivalent to RIAA, also known as ladrones. Beware, no copy-left music there. It is shameful in so many ways that I better not get started.

Saying Creative Commons is not an option is outrageous. Not just because I’ve always been a FLOSS advocate and CC simply fits in my mindset. I believe these kind of options simply make the World a better place.

Take this as an example. My friend Álvaro had a CC song playing today. It is called Code Monkey. Not only did I love it, being a geek and all. Knowing it was CC, I googled about the author, and it turns out Jonathan Coulton releases his work under CC. He used to be one of us (and forever will be), but he switched fields from IT to music, and he seems to be doing pretty well. Kudos to you, sir! I love your work.

I’m pretty sure he would have had a hard time trying to live from his art through mainstream media (yes, Mu$ic Indu$try, I’m talking about you).

Not only does he succeed and has made my day a lot more fun. I also found out his work has been used in award winning works, which is something permitted by the licensing used. This music-clip won several Anime contests back in 2007. I’m not saying it is like winning a Nobel prize. But Madonna is not going to win one either. And quite frankly, seeing that Henry Kissinger once won the Peace Nobel prize, this shouldn’t even be considered as a dignifying example.

Check out the videoclip. I for one had a warm fuzzy feeling listening to Jonathan Coulton‘s work. You can buy all his pieces at really really inexpensive prices.

Our first Cherokee screencast

Alvaro an I have been putting together a screen-cast to show an overview of Cherokee-Admin’s capabilities. It is just an introduction, but I think this kind of thing is really helpful to spread out the word about Cherokee’s multiple merits.

We wanted to brag about our little baby. After all, not every serious web server out there has a killer interface to configure it. Take a look at our Cherokee Web Server introductory screen-cast.

You might want to see it at full screen for readability.

It’s just one of many to come. We’ve got some more planned, so I’ll let you know when they’re ready.

 
WordPress Loves AJAX