For analyzing token frequencies, I have implemented two different designs.Design A: For different data sources, create specific data readers that convert your data into Records. For frequency analysis, operate on the abstraction of RecordAdvantage:
- Easy to maintain, implement once and for all
- Reduces complexity of the project
Disadvantage:
- Poor performance
- High memory requirement, since you are doing analysis record by record, instead of column by column, you need to keep frequency data for ALL columns at the same time
Design B: For each type of data source (database, character delimited file etc.) we implement a specific analyzer in addition to a data readerAdvantage:
- Better performance in some data sources since you are operating on a lower level of abstraction
- Possibility of doing analysis column by column, the way it is supposed to be
Disadvantage:
- Each data source needs a different analyzer, increases complexity
- Could be tiresome to implement
Performance difference becomes apparent in analyzing databases. Instead of loading our data into memory, processing it and writing it back, we can just ask MySQL to tell us token frequencies with a query, and store this information for later use. Below is the result for a data source of 5000 records (fileA_5000) consisting of 19 columns. Token frequency analysis of one column takes:Design B: 1843 millisecondsDesign A: 4035 millisecondsDesign A does not improve much even if we reduce column count to one. This could mean that execution time consists of mostly overhead from iterating over all records and numerous function calls it brings.Last word:Even though there is significant performance difference for database data, I wouldn’t argue passionately for design B. This decision depends on how many different data sources we are planning to support, their nature (possibility of performance gain or not) and whether execution time stays within reasonable limits in design A or not.






Datasource_analysis table mimics LinkDataSource class in org.regenstrief.linkage.io package. I am imagining a GUI in OpenMRS where the user manually chooses among existing data sources, or adds a new data source in which the datasource_id is automatically assigned by the database.Field table contains changed and data_changed attributes to determine how fresh the statistics are.