[Metamorph] Re: [culturegraph] Metamorph: Grouping entites based on entity suffixes

Pascal Christoph christoph at hbz-nrw.de
Wed Apr 17 10:08:37 CEST 2013


Hi Christoph,

although I am unsure if I have comprehended your situation entirely, I think we
have encountered a similar problem. Basically I would call it "records in
records" (or, thinking in graphs: dealing with multiple nodes contained in one
record).

While transforming the ZDB-ISIL authority file into RDF for
lobid-organisation[1] we have to deal with Blank Nodes (in essence, a new node,
or if you like, a new record). A workaround enables metamorph to deal with
these (at least with a depth of one - records in records in records (...) are
not possible). Have a look at [2] for the idea. The value 'bnode' of the
attribute 'name' in the morph-xml data-tag (...<data source="..."
name="bnode">...) lets the PipeEncodeTriples treats the value of the attribute
'format' of the regexp-tag as a triple, thus creating a "record" (a (blank) node).
This node is linked to the root-node (the default (or "top") record), if the
value for morph 'format="$value"' begins with an '_:' like in 'format="_:a"'.

We are aware that this is only a hack since it puts too much
assumptions/semantics into values. A new (optional) keyword like '<data> ...
type="beginSubrecord" </data> would be better. Give that to the encoders
like in "public void literal(final String name, final String value, final
String type)" (while 'type' would be default just null and ignorable).

We then would be enabled to use metamorph to declare e. g. the beginning of new
nodes, like:
'<data source="101@" name="idFragment" type="beginSubrecord">'
and/or
'<data source="201B/01" name="idFragment" type="beginSubrecord">'
'<data source="201B/01" name="yourElement" type="subrecord">'
'<data source="201B/02" name="idFragment" type="beginSubrecord">'
'<data source="206W/02" name="idFragment" type="subrecord">'

...
(where 'idFragment' is something you need to build a uniqe identifier and
'yourElement' treated as today).
To close a subrecord it would be enough to encounter a new "beginSubrecord" (as
we assume that a one-depth of records-in-records is all that we will ever
need ;) ) or a 'null'-type (as this would be the default 'root-node'.

This enhancement should be backward compatible.
These are just ideas and they need rethinking, no doubt.

However, the solution you provided with the PicaItemSplitter is in my eyes not
very satisfactionary: it is not generic and thus we cannot reuse it to
deal with our "blank nodes-problem".

We would be glad if you want to go into that direction. What do you think ? We
also may contribute code if you like.

best wishes,
pascal

[1]http://lobid.org/organisation
[2]https://github.com/lobid/lodmill/commit/ae5be2e282c32ae40ea0167eafe65a8477bbbffa

Am 11.04.2013 13:01 schrieb Böhme, Christoph :

> Hi all,
> 
> I am trying to use Metamorph to transform title records from the ZDB dataset so that the item information can be handled more easily. However, I am not sure if I can achieve this transformation with Metamorph.
> 
> In the ZDB dataset each record contains title information and item information for each instance of the title in a library. The item information is described by a sequence of fields. These fields are grouped by a common suffix per item. The suffix is not unique within a record, though; fields describing items belonging to two different libraries may use the same suffix. However, these sequences are (apparently) separated by (an undocumented) field. So, in a nutshell the input records look like this:
> 
> record {
> 	/* ... fields describing the title ... */
> 
> 	101@ { a: '12' },
> 	201B/01 { /* literals */ },
> 	203@/01 { /* literals */ },
> 	101@ { a: '34' },
> 	201B/01 { /* literals */ },
> 	206W/01 { /* literals */  },
> 	201B/02 { /* literals */  },
> 	206W/02 { /* literals */  },
> 	203@/02 { /* literals */  }
> }
> 
> Obviously, this format is difficult to work with in Metamorph. To make further processing easier, I would like to group all fields describing one item within an entity. So, the output of my transformation should look like this:
> 
> record {
> 	/* ... literals and entities describing the title ... */
> 
> 	101@ { a: '12' },
> 	item {
> 		201B { ... },
> 		203@ { ... }
> 	},
> 	101@ { a: '34' },
> 	item {
> 		201B { ... },
> 		206W { ... },
> 	},
> 	item {
> 		201B { ... },
> 		206W { ... },
> 		203@ { ... }
> 	}
> }
> 
> So, what needs to be done is basically: open a new "item" entity (and close the previous one) every time the text after the slash changes or if an entity without a slash is encountered (the "101@" entities in the example). However, I am not sure if this can be achieved with Metamorph. Any comments are appreciated!
> 
> Cheers,
> Christoph
> 
> 
> 
> _______________________________________________
> culturegraph mailing list
> culturegraph at lists.d-nb.de
> http://lists.d-nb.de/mailman/listinfo/culturegraph
> 




More information about the Metafacture mailing list