A Django site.
July 10, 2007
» Phase3 Complete

Now that Phase3 is ready for review, we have added weight scaling functionality to patient matching module. In fact, I have both versions of weight scaling implemented (Framework A and B). This week, I will be performing performance tests this week to determine if there is noteworthy performance difference between these two approaches.

3. Runtime Component, operational:  Modify the ScorePair method to incorporate frequency scaling.  This process should be performed incrementally, in two phases.The first phase will hard code the frequency scaling equation, into the existing ScorePair method.  Once the entire linkage process (from analytics to operational phase) has been tested and successfully implements frequency scaling as a prototype, we will proceed to phase 2.In the second phase ScorePair will be re-factored to accommodate a framework that accepts future modifications to linkage scores established by the Felligi-Sunter model.  These modifications include the frequency scaling, and will also include modifying the agreement weight based on the degree of string similarity as established by various string comparators.

July 2, 2007
» Phase2 Complete

I refactored the code for Phase1 according to the feedback I received, and completed Phase 2 (Here is the changeset). I will make a more detailed post tomorrow (before the phone call). The next step (changing the formula for weight scaling) should be fairly easy, after I figure out where scores are calculated in the existing code. Testing and commenting the code should not take too long either, since I know that the code is functional, just have to check for boundary cases etc.

Runtime Component, start-up:  Implement functionality instantiating a data structure that provides fast lookup of individual token frequencies.  This data structure will likely be a hash table, where the key is the token value (eg, last name of “SMITH”) and the value is the token frequency (eg, 2,102). This data will be loaded from the persistent data structure created in task 1(e).Because the primary performance constraint for weight-based frequency scaling will be the lookup, we will need to be able to configure the number of elements loaded into the hash table.  For example, it is likely that some fields will have hundreds of thousands of unique tokens (eg, name fields), while others will have on the order of 10 or 20 (middle initial, month of birth).Also, weight scaling can be used to either increase or decrease individual field weights.  If an individual token frequency is less than the average frequency it will be increased, if it is above the average frequency it will be decreased.  Consequently, there needs to be some ability to configure the total number of tokens loaded into the lookup structure for each field.a. Implement functionality to load top ‘N’ most/least frequent tokens from the persistent data structure, where top, bottom, and ‘N’ are specified in the configuration file.  Ifb. Other (future) options may include top or bottom N%, frequencies above or below N.He

June 25, 2007
» Phase1 Complete

I completed Phase 1, the Analytic Component. It includes:

Implement and analytic object that performs the following tasks:a. Read linkage configuration file and determine which fields/columns need to be scaled.b. Connect to data source containing individual tokens. Assume that the data source is either a flat file such as a CSV file or a relational database.c. Add new information to configuration file that indicates location of token frequenciesd. Count frequencies of individual tokens.e. Store token frequency results in persistent structure (eg, a relational database table).  In order to access the token frequency data at runtime, the frequency tables need to be identified in the configuration file.  Thus, will need to develop a programmatic scheme to identify each token frequency table associated with a given data source, eg:

What has changed in the source code?Testing code has been moved into a new package called org.openmrs.testingorg.regenstrief.linkage.analysis,A new abstract class for analyzers called DataSourceAnalyzerTwo classes that extend it are CharDelimFileAnalyzer and DataBaseAnalyzerThese classes contain a CharDelimFileReader/DataBaseReader to go over/query recordsIn this schema, existing classes such as DataSourceAnalysis, Analyzer  and ScaleWeightAnalyzer are not used. The schema provided by existing classes was more general, and if we can find a way to fit my classes into it, I’m willing to refactor the code. Otherwise, I’ll delete them.org.regenstrief.linkage.dbAnalyzers contain a LinkDBManager to insert token frequencies into the database. I’m not sure if this was the most suitable class for adding this code.org.regenstrief.linkage.ioDataSourceReader has a new parameter to determine if it will be used for analysis or reading. This change was necessary because in reading, some columns are excluded and blocking is done to make the source ready for linkage. However, we don’ want these in analyzing the data.org.regenstrief.linkage.utilLinkDataSource: Added a variable to store a unique identifier for each linkdatasource (is used in storing token frequencies)RecMatchConfig: Added a LinkDBManager to create a connectionXMLTranslator: Modified to include id of the linkdatasource and a new parameter for storing token frequencies 

June 16, 2007
» Database schema for analysis results

Here is a draft on how to store analytical phase results in a relational database table. I created this diagram using DBDesigner 4, the same tool used to create OpenMRS Data Model. Right click here and choose ”Save target as” if you’d like to load this schema in DBDesigner and modify it.model.pngDatasource_analysis table mimics LinkDataSource class in org.regenstrief.linkage.io package. I am imagining a GUI in OpenMRS where the user manually chooses among existing data sources, or adds a new data source in which the datasource_id is automatically assigned by the database.Field table contains changed and data_changed attributes to determine how fresh the statistics are.

May 29, 2007
» Agenda for the week

After receiving a great guideline on how to implement weight scaling from James, I examined part of the existing code today. In the analytic phase, for each field, we need to calculate:(1) number of unique values(2) frequency of each unique value(3) total recordsThese calculated values should be stored in the database for future analysis.My idea for the data structure to hold these values, ScaleWeightData, is:- integer array indexed by column id for each field regarding (1) and (3)- hash table indexed by token that contains most frequent K records for (2), with database lookups for tokens that are not found in the hash table (for large data sets)A few questions as usual:1) Scalability: What magnitude of scalability are we trying to achieve? Will scalability be a major design concern right from the beginning or are we trying to make it work first, and later worry about scalability? Should I worry about a real system with millions of patient records that may not fit into memory?2) Null values: How do they affect our statistical analysis? Do we ignore them when calculating total number of records?

May 20, 2007
» Connecting the dots

After reading the Fellegi-Sunter paper, I’d like to map the ideas described in words to formulas used in the original paper:

(1) FS generates a likelihood score based on agreement pattern among corresponding fields from 2 records. The higher the likelihood score, the more likely two records represent a match, rather than simple random agreement among fields.


Since we expect more fields to agree on matches, the ratio will have high values for matches, and low values for unmatched records.

(2) Digging deeper, for a given record pair, the score is calculated by multiplying the *agreement* likelihood weight for each  set of corresponding fields that agree (eg, both last names are ‘JONES’) and multiplying by the *disagreement* weight for corresponding fields that disagree (eg, dates of birth disagree).

Score =

(3) Conditional independence is assumed when we multiply each field’s full weight.

(4) There are a few benefits to assuming conditional independence. First, we can make some reasonable estimates regarding the true-positive and true negative rates for various likelihood scores. These measures help us set a score threshold for true- and false-links.

I’d like to get the terminology right with true-links and false-links: 
Let’s say we pair all records together. If two records in the pair belong to the same person in reality, we call it a true-link. Those records that are paired together, but correspond to different people in reality, are called false-links. In the FS paper, these correspond to:
`A

True-links:
`M

False-links:
`U

(5) Second, assuming conditional independence allows us to associate monotonically increasing scores with increased true positive rates. Simply stated, higher likelihood scores can be considered to be more likely true matches.

`w^{k}(gamma_{k})=log

`w(gamma)=w^{1}+w^{2}+cdotcdotcdot+w^{k}`

So what we would like to do is to enhance the scoring function, so that it takes into account the frequency of the values being compared:

Some questions that came up to my mind:

A) Please comment on the mapping between ideas and formulas, and point out any misunderstandings I have.

B) In the formula, we assume each identifier to have equal weight. However, in reality, some identifiers can provide us more information than others (Shannon’s entropy). For instance, if we have records from a neighbourhood, their zip codes will be mostly same and it won’t be of much use. This suggests a proposed scaling like this:

However, isn’t this kind of scaling already inherit in the formula? In the extreme case where all the values of a field are the same, m/u will be 1 and the logarithm of it will be 0, meaning that it won’t have an effect on the score. (assuming that EM provides good estimates)

C) String comparison functions do not have to produce binary results, right? They don’t have to say either these two field match or do not match. They can give any value between 0…1

» Background check

In Overview of Record Linkage and Current Research Directions, Winkler states that basic ideas in Fellegi-Sunter model are based on statistical concepts such as odds ratios, hypothesis testing, and relative frequency.

Luckily I had a lot of exposure to hypothesis testing and relative frequency in the statistical modeling course this semester. Apparently the rest is basic probability:

Odds ratio:  Tells how much more likely it is that the event A occurs than it is that it does not occur.
`frac{P(A)}{P(A^{C}