[Metamorph] Re: AW: [Spam-Wahrscheinlichkeit=99]Re: [culturegraph] Metamorph: Grouping entites based on entity suffixes

Pascal Christoph christoph at hbz-nrw.de
Thu Apr 25 10:45:28 CEST 2013


Hi Christoph,

thanks for the explanation, sounds promising. I will have a look at it on
occasion.

cheers, pascal

Am 17.04.2013 15:54 schrieb Böhme, Christoph :

> Hi Pascal
> 
> I think, our problems are similar in so far as we both want to split a record into separate records containing only parts of the data in the original record. On a closer look our problems are quite different, though. While you seem to have no problems specifying the information that should be part of a subrecord but cannot create such records in Metamorph, I did not even get to this point yet but already failed when trying to specify which information should go into a particular record.
> 
> For your problem there may be a simple solution, though. In metamorph you can use <entity> to create groups of literals (and entities as there is no limit on nesting entities). In your PipeEncodeTriples module you can then turn each entity into a blank node.
> 
> In our linked data service we use this approach to group the information that should got into blank nodes in entities. In two subsequent steps we then turn the output of the metamorph module into RDF-XML: First, the RdfMacroPipe module [1] is used to expand some abbreviations used in the metamorph script. Second the expanded data is serialised as XML using the SimpleXmlWriter module [2]. An example of such a transformation is the Marc21EDM transformation example in the metafacture package. If you are interested in the metamorph scripts used to create our RDF-XML output in the linked data service, please let me know.
> 
> In principle it would be possible to output other RDF serialisations than XML using the method described above. However, our current Metamorph scripts are geared towards writing RDF-XML with the SimpleXmlWriter which makes it difficult to change the serilisation format by simply swapping the SimpleXmlWriter with something like a SimpleTurtleWriter module. In order to do this, the Metamorph scripts would need to output a more format-agnostic representation of the RDF data. An additional problem when serialising RDF is that data can be serialised in different ways (This is in particular the case for RDF-XML). In the linked data service this specification of the serialisation-style and the actual conversion from PICA to RDF is currently not separated. If the output of the Metamorph scripts becomes more format-agnostic, however, this specification of the serialisation-style would need be done separately. A possible solution could be a a sequence of two Metamorph-scripts: the fi
 rst perfo
rms a sematic conversion to RDF and the second one shapes the RDF output for a specific output format which is understood by a SimpleXmlWriter, SimpleTurtleWriter or SimpleSomethingElseWriter. 
> 
> Cheers
> Christoph
> 
> PS: The PicaItemSplitter is not intended to be a flexible solution for handling subrecords. It is only a quick fix for extracting item information from pica records :-)
> 
> 
> [1] RdfMacroPipe: https://github.com/culturegraph/metafacture-core/blob/master/src/main/java/org/culturegraph/mf/stream/pipe/RdfMacroPipe.java
> [2] SimpleXmlWriter https://github.com/culturegraph/metafacture-core/blob/master/src/main/java/org/culturegraph/mf/stream/sink/SimpleXmlWriter.java
> 
> 
> 
>> -----Ursprüngliche Nachricht-----
>> Von: Pascal Christoph [mailto:christoph at hbz-nrw.de]
>> Gesendet: Mittwoch, 17. April 2013 10:09
>> An: Böhme, Christoph
>> Cc: metamorph at lists.d-nb.de; culturegraph at lists.d-nb.de
>> Betreff: [Spam-Wahrscheinlichkeit=99]Re: [culturegraph] Metamorph: Grouping
>> entites based on entity suffixes
>> 
>> Hi Christoph,
>> 
>> although I am unsure if I have comprehended your situation entirely, I think we
>> have encountered a similar problem. Basically I would call it "records in
>> records" (or, thinking in graphs: dealing with multiple nodes contained in one
>> record).
>> 
>> While transforming the ZDB-ISIL authority file into RDF for
>> lobid-organisation[1] we have to deal with Blank Nodes (in essence, a new node,
>> or if you like, a new record). A workaround enables metamorph to deal with
>> these (at least with a depth of one - records in records in records (...) are
>> not possible). Have a look at [2] for the idea. The value 'bnode' of the
>> attribute 'name' in the morph-xml data-tag (...<data source="..."
>> name="bnode">...) lets the PipeEncodeTriples treats the value of the attribute
>> 'format' of the regexp-tag as a triple, thus creating a "record" (a (blank) node).
>> This node is linked to the root-node (the default (or "top") record), if the
>> value for morph 'format="$value"' begins with an '_:' like in 'format="_:a"'.
>> 
>> We are aware that this is only a hack since it puts too much
>> assumptions/semantics into values. A new (optional) keyword like '<data> ...
>> type="beginSubrecord" </data> would be better. Give that to the encoders
>> like in "public void literal(final String name, final String value, final
>> String type)" (while 'type' would be default just null and ignorable).
>> 
>> We then would be enabled to use metamorph to declare e. g. the beginning of
>> new
>> nodes, like:
>> '<data source="101@" name="idFragment" type="beginSubrecord">'
>> and/or
>> '<data source="201B/01" name="idFragment" type="beginSubrecord">'
>> '<data source="201B/01" name="yourElement" type="subrecord">'
>> '<data source="201B/02" name="idFragment" type="beginSubrecord">'
>> '<data source="206W/02" name="idFragment" type="subrecord">'
>> 
>> ...
>> (where 'idFragment' is something you need to build a uniqe identifier and
>> 'yourElement' treated as today).
>> To close a subrecord it would be enough to encounter a new "beginSubrecord" (as
>> we assume that a one-depth of records-in-records is all that we will ever
>> need ;) ) or a 'null'-type (as this would be the default 'root-node'.
>> 
>> This enhancement should be backward compatible.
>> These are just ideas and they need rethinking, no doubt.
>> 
>> However, the solution you provided with the PicaItemSplitter is in my eyes not
>> very satisfactionary: it is not generic and thus we cannot reuse it to
>> deal with our "blank nodes-problem".
>> 
>> We would be glad if you want to go into that direction. What do you think ? We
>> also may contribute code if you like.
>> 
>> best wishes,
>> pascal
>> 
>> [1]http://lobid.org/organisation
>> [2]https://github.com/lobid/lodmill/commit/ae5be2e282c32ae40ea0167eafe65a84
>> 77bbbffa
>> 
>> Am 11.04.2013 13:01 schrieb Böhme, Christoph :
>> 
>> > Hi all,
>> >
>> > I am trying to use Metamorph to transform title records from the ZDB dataset so
>> that the item information can be handled more easily. However, I am not sure if I
>> can achieve this transformation with Metamorph.
>> >
>> > In the ZDB dataset each record contains title information and item information
>> for each instance of the title in a library. The item information is described by a
>> sequence of fields. These fields are grouped by a common suffix per item. The
>> suffix is not unique within a record, though; fields describing items belonging to
>> two different libraries may use the same suffix. However, these sequences are
>> (apparently) separated by (an undocumented) field. So, in a nutshell the input
>> records look like this:
>> >
>> > record {
>> > 	/* ... fields describing the title ... */
>> >
>> > 	101@ { a: '12' },
>> > 	201B/01 { /* literals */ },
>> > 	203@/01 { /* literals */ },
>> > 	101@ { a: '34' },
>> > 	201B/01 { /* literals */ },
>> > 	206W/01 { /* literals */  },
>> > 	201B/02 { /* literals */  },
>> > 	206W/02 { /* literals */  },
>> > 	203@/02 { /* literals */  }
>> > }
>> >
>> > Obviously, this format is difficult to work with in Metamorph. To make further
>> processing easier, I would like to group all fields describing one item within an
>> entity. So, the output of my transformation should look like this:
>> >
>> > record {
>> > 	/* ... literals and entities describing the title ... */
>> >
>> > 	101@ { a: '12' },
>> > 	item {
>> > 		201B { ... },
>> > 		203@ { ... }
>> > 	},
>> > 	101@ { a: '34' },
>> > 	item {
>> > 		201B { ... },
>> > 		206W { ... },
>> > 	},
>> > 	item {
>> > 		201B { ... },
>> > 		206W { ... },
>> > 		203@ { ... }
>> > 	}
>> > }
>> >
>> > So, what needs to be done is basically: open a new "item" entity (and close the
>> previous one) every time the text after the slash changes or if an entity without a
>> slash is encountered (the "101@" entities in the example). However, I am not sure
>> if this can be achieved with Metamorph. Any comments are appreciated!
>> >
>> > Cheers,
>> > Christoph
>> >
>> >
>> >
>> > _______________________________________________
>> > culturegraph mailing list
>> > culturegraph at lists.d-nb.de
>> > http://lists.d-nb.de/mailman/listinfo/culturegraph
>> >
> 
> 
> 
> _______________________________________________
> culturegraph mailing list
> culturegraph at lists.d-nb.de
> http://lists.d-nb.de/mailman/listinfo/culturegraph
> 




More information about the Metafacture mailing list