Wednesday 14 September 2011

Search pypi through an authenticating proxy

In an effort to do more rigorous python testing, I've been looking into alternative packaging tools that facilitate building isolated test environments. Previously I'd used easy_install and distribute. A popular alternative that plays nicely with virtulenv is pip.

It turns out that pip 0.72 has a few annoying little bugs, one of which is search does not support xmlrpc proxies, in particular authenticating proxies, correctly (207, 243). Changing pip, and the associated version of pip in virtualenv would be a slow process (I assume), and not something I'm capable of testing at the moment.

My quick fix solution is a command-line tool for searching pypi. This is done by replacing the transport used by xmlrpclib with one that uses urllib2. However, I've (informally) tested similar code in pip by patching download.py, and can confirm that it worked correctly on my system.

import os
import urllib2
import xmlrpclib

# install a proxy handler if required
_proxy = os.environ.get('http_proxy', None)
if _proxy:
    proxysupport = urllib2.ProxyHandler({"http": _proxy})
    opener =  urllib2.build_opener(proxysupport, urllib2.CacheFTPHandler)
    urllib2.install_opener(opener)

class ProxyTransport(xmlrpclib.Transport):
    def request(self, host, handler, request_body, verbose):
        self.verbose = verbose
        url = 'http://%s%s' % (host, handler)
        request = urllib2.Request(url)
        request.add_data(request_body)
        # Note: 'Host' and 'Content-Length' are added automatically
        request.add_header("User-Agent", self.user_agent)
        request.add_header("Content-Type", "text/xml") # Important
        f = urllib2.urlopen(request)
        return(self.parse_response(f))

if __name__ == "__main__":
    import sys
    pypiurl = 'http://pypi.python.org/pypi'
    transport = ProxyTransport()
    pypi = xmlrpclib.ServerProxy(pypiurl, transport = transport)
    packages = pypi.search({'name' : sys.argv[1]})
    for pkg in packages:
        print pkg['name']
        print '  ', pkg['summary'].splitlines()[0]

Monday 12 September 2011

Searching LaTeX files

I'm making my final amendments to my thesis at the moment, and one of my examiners suggested a change of terminology, and I needed a convenient way to find every instance of a word in a group of LaTeX files.

Doing this with just grep is possible, but retrieves a lot of false positives in comments, or you need to use more complex regular expressions, which are difficult to write and maintain. Another alternative is pipe from one instance of grep to another.

For what I'm doing, it seemed easier to split the task into two smaller pieces. A Perl script to remove LaTeX comments, and egrep to do the search.

#!/usr/bin/env perl
use strict;
use warnings;

for my $arg (@ARGV) {
    my $fh;
    open($fh, "<$arg");
    print "$arg\n";
    while(<$fh>) {
        chomp;
        if( m/([^%]*)/ ) {
            print "$arg:\t$1\n";
        }
        elsif( m/([^%]*)%.*/ ) {
            # matches word pattern
            print "$arg:\t$1\n";
        }
    }
    close $fh;
}

Which is called like so:

latex-comment-filter ${FILES} | egrep -w --color ${SEARCHPATTERN}

Tuesday 6 September 2011

Book Review - Continuous Delivery

Just before MPUG last night, I dropped into the RMIT bookstore to have a browse. I found a copy of Continuous Delivery that I'd read about on Martin Fowler's blog (bliki?). I've probably read about two thirds of it going in and out on the train today.

I'm been pretty wary of buying process style software engineering books, they tend to be either vague or too focused (typically on technology stacks I don't care about). This was interesting, mainly because of the difficulties I've seen in testing, managing third party dependencies and deployment. I'd also seen some of the blog posts on the books website and been fairly impressed with the writing style.

The table of contents looks fantastic: configuration management, continuous integration, unit/integration testing, acceptance testing, non-functional requirements, deployment pipelines. Sadly, the actual book hasn't quite lived up to my expectations. My biggest issue is that the material is presented in a fairly cumbersome way. After a chapter explaining whats coming, you get a couple of chapters that expand (a bit) on those ideas, and then you get a summary of whats been presented. Perfect if you can get your manager to read a chapter with the intention of selling them on new process, less ideal if you're a practitioner trying to learn something new

My second concern is that the book has a fairly simple message that the way to achieve continuous delivery is to script everything. While process, business constraints and education are discussed, they're very much secondary, which I think is a mistake in a technology agnostic book.

The message of the book seems very relevant for very large desktop applications and web applications but less so for projects developed by small teams, library developers and research coders. This is disappointing, because I consider myself to fall into these later categories.

Regardless, the book has been a bit of fun, it has been a while since I've read that much of a single tech book in one day. The anecdotes throughout the text show that the authors are both knowledgeable and experienced in using a range of tools, and I've found myself eagerly reading every sidebar.

Saturday 3 September 2011

Nine letter target

My wife's grandfather is an avid newspaper reader, with a sharp and critical mind. He loves solving puzzles. Last time we caught up, I jokingly said that I could write a program to solve the puzzle he was working on.

The puzzle was a nine letter target, common in Australian newspapers. The aim is to find all possible words in a 3 by 3 grid of letters with between 4 and 9 letters, and the word must also contain the central letter. A puzzle generator is available online.



I managed it, just ;-), with a bit of haggling over whether some of the words were acceptable.

import sys
import collections

def freq(letters):
    '''
    Construct a mapping of letter to frequency in a particular word

    { char -> int }
    '''
    d = collections.defaultdict(int)
    for c in letters:
        d[c] += 1
    return d

def validWord(word, special, targetFreq):
    '''
    Check if a word matches the nine letter word rules
    '''
    # must satisfy length criteria
    if len(word) < 4:
        return False

    # must contain central letter
    if special not in word:
        return False
    # must satisfy letter frequencys
    f = freq(word)
    for c in word:
        if c not in targetFreq:
            return False
        if f[c] > targetFreq[c]:
            return False
    return True

def nineLetterTarget( letters, dictionaryFile ):
    matches = []
    letters = letters.lower()
    special = letters[4]
    targetFreq = freq(letters)
    with open( dictionaryFile, 'r' ) as dictfh:
        for line in dictfh:
            # remove end of line marker
            line = line[:-1]
            # lower case the word
            line = line.lower()
            # only keep the word if it matches
            if validWord(line, special, targetFreq):
                matches.append(line)
    matches.sort( key = lambda word : len(word) )
    return matches

if __name__ == "__main__":
    dictFile = '/usr/share/dict/british-english'
    letters  = 'vnogdreec'
    words = nineLetterTarget( letters, dictFile )
    for word in words:
        print word

Friday 2 September 2011

Multiple-file LaTeX diff

One major pain point I've faced while finishing my thesis is finding a nice way to show my supervisor exactly what revisions I'd made since our last chat. One thing that made a big difference to this process was to use latexdiff. One of the limitations of that tool is that it doesn't support documents that span multiple files, a feature used heavily in my thesis template.

The work-around was to write a small python script to glue the files back together, and so here is a script to flatten LaTeX files

#!/usr/bin/python
import sys
import os
import re

inputPattern = re.compile('\\input{(.*)}')

def flattenLatex( rootFilename ):
    dirpath, filename = os.path.split(rootFilename)
    with open(rootFilename,'r') as fh:
        for line in fh:
            match = inputPattern.search( line )
            if match:
                newFile = match.group(1)
                if not newFile.endswith('tex'):
                    newFile += '.tex'
                flattenLatex( os.path.join(dirpath,newFile) )
            else:
                sys.stdout.write(line)

if __name__ == "__main__":
    flattenLatex( sys.argv[1] )

Which ends up being called like this:

# merge multiple files into the old and current versions of the document
flatten-latex ${DIFFTREE}/thesis.tex > old.tex
flatten-latex ${WORKINGTREE}/thesis.tex > cur.tex

# produce the marked up document
latexdiff old.tex cur.tex > tmp.tex

# fix line ending problem introduced by latexdiff
sed 's/^M//' tmp.tex > diff.tex

Thursday 1 September 2011

GCC No Return Errors

Most of the time that I'm writing C/C++ code, I expect that my compiler will just "do the right thing". Ran into an interesting problem which highlights just how naive that can be, and started looking into strategies for preventing this problem and ones like it in the future.

In C++ it is permissible to omit a return statement from a functions that have been declared with non-void return types. For example, in the following code, the function bar omits the return. This however, is getting into the realms of undefined behavior

struct Foo { 
    int a;
    int b;
};

Foo bar() { 
    // missing return
}

int main() {
    Foo f = bar();
    return 0;
}

By default compiling that snippet of code using gcc:

g++ -c snippet.cpp

gives no warnings at all. Most of the time, I'm careful enough to at least compile with warnings enabled (-Wall)

> g++ -Wall -c snippet.cpp
snippet.cpp: In function ‘Foo bar()’:
snippet.cpp:8: warning: no return statement in function returning non-void

which is at least a helpful warning message. However, at times it can be fairly easy to ignore warnings, and this is where I got bitten. The no return statement warning was hidden amongst warnings being generated by third party library code, and I (wrongly!) assumed that it wasn't important.

I'd prefer that this didn't happen again in the future. GCC has a -Werror flag that will convert all warnings to errors, but in my case, I'd prefer fast development cycles to having perfectly clean and portable code (at least for prototyping).

Turning to stackoverflow I found a method for converting the warning to an error, and in the process learned how to use gcc warning names and diagnostics. The final solution is to use -Werror=, and in this case -Werror=return-type.

g++ -Werror=return-type -Wall -c snippet.cpp
snippet.cpp: In function ‘Foo bar()’:
snippet.cpp:8: error: no return statement in function returning non-void

The side-effect of doing this highlighted how important it will be in the future to consistently (but selectively) start to force myself to deal more rigorously with warnings. Again, stackoverflow to the rescue to identify the flags that are most likely to help.

It's probably worth noting that under MSVC, this warning is automatically promoted to an error by default