I’d like to read a text file and store the frequency of each word in it into a database. Memory is fast, database is slow. At the two extremes, we have:1) I don’t use any memory, for each word I read, I query the database to learn it’s frequency, and update it in the database.Problem: Very Inefficient2) Everything is stored in the memory, I keep a hashtable in which I store frequencies of words. After reading the whole file, I write all entries to the database.Problem: Very fast, but I may have so many words that they won’t fit into memory.Solution:I keep part of the words in a hashtable, for those entries that are not in the hashtable, I query the database.SubProblem: How do I decide if I’ll put a word in the hashtable or not? I’d like to store the most frequent words in the hashtable so that I will minimize the number of queries to the database. But I don’t know frequencies of words in advance, they change as I read the file. My hashtable has to organize itself as I read the text file.Solution: If the next word I read is not in the hashtable, and is more frequent than the *least* frequent word in the hashtable, I put the new word into the table and remove the least frequent one. The reasoning is that if a word has high frequency, it is likely that I will see more of the same word.SubProblem: I need a way to efficiently determine the least frequent word in the hash table. Further more, after I replace the least frequent one, which word will be the least frequent one now? What about next round?Solution: Along with the hashtable, I also keep a data structure called PriorityQueue. If implemented using a heap, it provides insert, remove and findMin operations in O(logN) time.Here are some experimental results for analyzing one column of 10,000 records:
- Without a lookup table, only using database: 16 seconds
- With a lookup table of 250 records (covers 60% of data): 8 seconds
- With a lookup table of 500 records (covers 80% of data): 5 seconds
Any ideas or comments on this approach will be appreciated ![]()






