[Metafacture] Metafacture / Metamorph hurdles

Christoph, Pascal christoph at hbz-nrw.de
Tue Nov 11 15:18:21 CET 2014


Hi Günter,

sorry for a late anwwer.
You have got a lot of questions.
Let me try to anwser at least _some_ of them.

Günter Hipler wrote on 13.10.2014 18:16 :

> Dear Metafactures,
> 
> together with my colleague Nicolas Prongué from HEG Geneve we tried to 
> play around with Metafacture and Metamorph principles.
> Our first aim was to define a basic transformation from MarcXML to RDF/XML.
> 
> After getting some hints and explanations from HBZ (especially Fabian, 
> thanks a lot!) how they use the entity mechanism in Metamorph
> (https://github.com/hbz/lobid-organisations/blob/master/src/main/resources/morph-enriched.xml) 
> Nicolas was able to define a first transformation from MarcXML into RDF/XML
> https://github.com/linked-swissbib/metafacture-runner/blob/master/examples/nicolas/morph-marc21_NP.xml
> Personally I think this is really nice because Nicolas isn't experienced 
> in scripting or programming at all so far and it shows from my point of 
> view one aspect of the potential using a DSL in Metamorph.
> 
> Our questions and experiences:
> a) Instead of RDF/XML we want to serialize in turtle. Is there a 
> specialized stream-type for this?
> For RDF/XML we used nested entities and the tilde mechanism to create 
> xml - attributes
> 
>              <data source="001" name="~rdf:about">
>                  <!--The symbol "~" before rdf:about is very important: 
> it permits to integrate rdf:about as an attribute in the tag 
> rdf:Description-->
>                  <compose prefix="http://data.swissbib.ch/resource/"/>
>              </data>
> 
> 
> If I'm not wrong nesting entity elements isn't appropriate for 
> turtle-triples. My example:
> https://github.com/linked-swissbib/metafacture-runner/tree/master/examples/gh/turtletest

RDF/XML is just another serialization of RDF. So you can convert RDF/XML to
turtle or ntriples as you like. Simplest solution would be to use e.g. "raptor"
to post process the file.

However, you can also use metafacture to do this.
You would pipe the XML as a String to Triples2RdfModel[1] which generates a
jena RDF model which can be piped to RdfModelFileWriter[2] to serialize it on
disk as you like. Note however that these metafacture commands are not part of
metafacture core (and I am not sure if they should be part of it). However, you
can freely combine commands if this serves your will.

> - Is it possible to define a generic Morph transformation which doesn't 
> depend on the output? The rdf-macros command might be helpful but I'm 
> not sure how to use it.

Not sure what you mean.

> I looked around in the HBZ lobid repository and found types like
> 
> triples-to-rdfmodel org.lobid.lodmill.Triples2RdfModel
> write-rdfmodel org.lobid.lodmill.RdfModelFileWriter
> encode-ntriples org.lobid.lodmill.PipeEncodeTriples
> which are using libraries from the org.apache.jena.* package. Based on 
> this I made some examples in our own sandbox repository which worked as 
> expected
> https://github.com/linked-swissbib/linked.swissbib.mf/tree/evaluation/examples/gh/lobid_hbz_map_triple
> 
> Is this the only way to create triple output? What were the reasons to 
> create these additional commands?

At the hbz we work rather with ntriples than with rdf/xml. The latter seems to
be preferred by dnb. The RDF-serialization-commands are more than a year old
and it may be that there are other mechanisms to serialize directly as ntriples
or turtle using metafacture core, I don't know. For smaller or not daily
conversions (<20 GB ?) you may also wanr to post-process with e.g. raptor,
that's pretty fast to convert rdf/xml in other serialization formats.

> I have the impression the metafacture-core 'template' command might help 
> to serialize output in turtle format but I'm not sure
> 
> - Is it possible to create RDF serializations other than XML with 
> Metafacture core commands  (comparable to stream-to-xml) ?

see above

> b) We would like to document our experiences we made so far using the 
> Metafacture framework. Our idea is to express 'our understanding' of the 
> various Metafacture/Metamorph pieces while using it in our real use 
> cases/processes.  This could be done on our project wiki and being 
> referenced e.g. on the culturegraph wiki as the central platform. Better 
> or further ideas are welcomed.

I'd too like see more real world examples. As more as better.

> I think this would make it a lot easier for other people to join the 
> community. At the moment the barrier to use the software for the first 
> time is really steep which could be one reason the feedback or activity 
> on the user list is so rare.
>
> Another idea to reduce the barrier:
> At the moment one can find shorter (snippet) examples in the 
> metafacture-runner repository. This is fine to get at least a first idea 
> what might be possible.
> But it would be really helpful to provide more comprehensive ('real 
> world') examples used in production workflows (which is already done by 
> HBZ in various repositories ) . Together with a documentation explaining 
> shortly the ideas behind them would be a great thing and really helpful 
> not only for newbies.
> 
> Perhaps these ideas and proposals could be discussed during the upcoming 
> Metafacture workshop at SWIB?

yes, why not. Good opportunity, I think.

> c) Beside transformations in RDF I'm thinking about the possibilities to 
> use Metafacture/Metamorph for our Search-engine document processing. At 
> the moment we use a combination of chaining XSLT templates together with 
> various Java-plugins for specialized tasks. 
> (https://github.com/swissbib/content2SearchDocs/tree/master/xslt)
> 
> Does anybody use Metafacture/Metamorph for SOLR as the target?

I know of Jan Steinberger at gesis , if I remember correctly they also use at
the moment xslt to feed solr and want to use metafacture for this. I just let
him know you have the same task before you so that you may help each other.

> Would be 
> really nice to see how the transformations are done. I stumbled upon 
> probably really simple questions:

At the hbz we use Elasticsearch and we have some commands to index[3] and even
harvest[4] it.

> -- how to create the xml - structure for a field (at the moment we don't 
> use JSON)
> 
>   <field name="id">
>          <xsl:value-of select="$fragment/myDocID" />
> </field>
> 
> Attributes could be created with nested structures (quite complicated 
> for really large documents) but I wasn't able to create a field value 
> without any additional tag. Metafacture wants to create an additional 
> tag for the value because the data element needs a name and source 
> attribute as well.

No experience in this as we don't generate xml.

> 
> -- how to create fixes simple structures like this one:
> 
> <field name="recordtype">marc</field>

> 
> My understanding of the Metamorph module: It is listening on 
> Metadata-events. But I don't have any event for such constant 
> structures. Any suggestion how to create them?

If you just want that one statement/structureSnippet will _always_ be included
in a record/model you can do this by generating it in morph when you hit a
field that _every_ record has , say the field where you normally build the ID
of the record/statement.
E.g. in [5] when hitting the ID field of the pica xml input "008H.e" not only
the rdf subject is emitted but also the type of this resource
(foaf:Organization) , it's foaf:primaryTopicOf and so on.

> Sorry for such a long post. We would be really happy to get some feedback!

hope I could help a bit. Looking forward to meet you at SWIB14,
pascal

[1]https://github.com/lobid/lodmill/blob/master/lodmill-rd/src/main/java/org/lobid/lodmill/Triples2RdfModel.java
[2]https://github.com/lobid/lodmill/blob/master/lodmill-rd/src/main/java/org/lobid/lodmill/RdfModelFileWriter.java
[3]https://github.com/hbz/mabxml-elasticsearch/
[4]https://github.com/lobid/lodmill/blob/master/lodmill-rd/src/main/resources/hbz01ES-to-lobid.flux
[5]https://github.com/lobid/lodmill/blob/master/lodmill-rd/src/main/resources/morph_zdb-isil-file-pica2ld.xml

> Günter





More information about the Metafacture mailing list