[Metafacture] SUM algorithm & grouping questions

Tue Jun 17 23:03:34 CEST 2014

Hi Nico!

Am Di 17 Jun 2014 13:03:33 CEST schrieb Nico Siebert:
> I have some questions about the metafacture framework.
>
> Are there any more information regarding the SUM algorithm apart from
> the flowchart?

I am afraid but apart from some rather high-level slides there is no
more detailed information on the algorithm. Luckily, the basic approach
is conceptually quite simple:

We assume that each record has a unique record-id which we can use to
refer to a record.

1. For each record keys are generated. This is the part described in the
flowchart. It is the only part of the algorithm that depends on
domain-specific logic.

2. A key-value list is created which lists all keys and for each key the
record-id of the record it was generated from

3. This list is then sorted by keys so that record-ids with the same key
are grouped together

4. These groups of record-ids with the same key are than turned into
"bundles"

5. If more than one key was generated for each record then there can be
records which appear in two bundles. For instance, we could have the
following keys:

  key-1 : record-A
  key-1 : record B
  key 2 : record-A
  key 2 : record-C
  key-3 : record B
  key-4 : record C

This would produce two bundles with more than one member: {record-A,
record-B} and {record-A, record-C}. As the members of a bundle are
considered equal, we assume from record-A = record-B and record-A =
record-C follows record-B = record-C (transitivity). So, the two bundles
should really be one. For this reason, the last step of the algorithm
computes the transitive hull for all bundles and merges bundles with
transitive links.

>From a mathematical perspective (I am not a mathematician so that might
not be a 100% correct!) the algorithm for computing the keys defines an
equivalence relation on the records (one for each type of key it
generates to be precise). The remainder of the algorithm then just finds
the records within one equivalence class and bundles them. Step five
finally joins the different equivalence relations.

> I understand that for each record 3 matchkeys are created and those
> will be compared to the corresponding matchkeys of another record.
>
> Now I wonder:
>
> 1. Is *every* key of a record compared to the corresponding key of a
> second record? If so, will any match result in a group or must at
> least 2 MatchKeys be equal? What is the logic behind this?
>
> Example:
> AMatchKey1        compareTo    BMatchKey1    RESULT=false
> AMatchKey2        compareTo    BMatchKey2    RESULT=false
> AMatchKey3        compareTo    BMatchKey3    RESULT=true
>
> --> Results in grouping of Record A and B ?

Yes, record A and B are grouped because one of their keys matched.

> 2. Is there any weighting implemented or are the matchkeys simply
> compared like mk1.match(mk2)?

No, there is no weighting. The keys are simply matched.

Hope things are a bit clearer now.

Best,
Christoph