A Django site.
July 10, 2007
» Code Spotlight: Analyzing token frequencies

I’d like to read a text file and store the frequency of each word in it into a database. Memory is fast, database is slow. At the two extremes, we have:1) I don’t use any memory, for each word I read, I query the database to learn it’s frequency, and update it in the database.Problem: Very Inefficient2) Everything is stored in the memory, I keep a hashtable in which I store frequencies of words. After reading the whole file, I write all entries to the database.Problem: Very fast, but I may have so many words that they won’t fit into memory.Solution:I keep part of the words in a hashtable, for those entries that are not in the hashtable, I query the database.SubProblem: How do I decide if I’ll put a word in the hashtable or not? I’d like to store the most frequent words in the hashtable so that I will minimize the number of queries to the database. But I don’t know frequencies of words in advance, they change as I read the file. My hashtable has to organize itself as I read the text file.Solution: If the next word I read is not in the hashtable, and is more frequent than the *least* frequent word in the hashtable, I put the new word into the table and remove the least frequent one. The reasoning is that if a word has high frequency, it is likely that I will see more of the same word.SubProblem: I need a way to efficiently determine the least frequent word in the hash table. Further more, after I replace the least frequent one, which word will be the least frequent one now? What about next round?Solution: Along with the hashtable, I also keep a data structure called PriorityQueue. If implemented using a heap, it provides insert, remove and findMin operations in O(logN) time.Here are some experimental results for analyzing one column of 10,000 records:

  • Without a lookup table, only using database: 16 seconds
  • With a lookup table of 250 records (covers 60% of data): 8 seconds
  • With a lookup table of 500 records (covers 80% of data): 5 seconds

Any ideas or comments on this approach will be appreciated :)

June 7, 2007
» Code reviews

Google was cool enough to send out Karl Fogel’s Producing Open Source Software book as a suprise gift to all Summer of Code ‘07 participants. I have already read that book at our local library after getting accepted, therefore I was hoping to receive another book, Open Sources 2.0: The Continuing Evolution :) Nevertheless, the book we received is more suitable in terms of the goals of this program, so there is nothing to complain :)Karl Fogel was one of the developers of Subversion, and a past Google employee. In his book, he shares his experience about different aspects of open-source software development, from setting up the technical infrastucture to licensing and copyright issues. This book is worth reading because it contains specific examples and good advice.Fogel mentions a developer named Greg Stein, who decided to set an example by reviewing every line of every single commit that went into the code repository and eventually started a tradition of code reviews among developers of Subversion project.As a young developer, code reviews would be a great way to get useful feedback and improve my coding abilities. Fogel argues that one could contribute as much to the project by reviewing others’ changes as by writing new code because bugs and non-optimal coding practices that would go unnoticed would be caught on the fly.I agree with him, and would love to receive some reviews after my first few commits :)