[Tp-legal] REMINDER: Nächste Sitzung der AG Legal am 25. Juni
Christof Schöch
schoech at uni-trier.de
Di Jun 25 13:12:41 CEST 2024
Dear all,
Many thanks for the great meeting today. One more thought that I just
wanted to document / not forget:
Format 1 (statistical derivatives, e.g.: term-document-matrix with word
frequencies) and format 2 (transformational derivatives, e.g. documents
with randomized word order) are maybe not necessarily as distinct as
they seem, depending on the exact examples for the two cases that we
consider. The reason is that some of them can be transformed into each
other, with a few assumptions.
- We can use a term-document-matrix to generate plain text with
randomized word order;
- We can use a document with randomized word order to generate a
term-document matrix;
A few observations:
- The relationship is not 100% symmetrical: From a document with
randomized word order, where randomization happens within a certain and
known segment size, we can either build a term-document-matrix that
maintains these segments (with separate columns for each segment) or we
can merge the frequencies for all segments belonging to one document
into a single column. Conversely, from a t-d-m with a given segmentation
performed before calculating the t-d-m, we can generate randomized texts
respecting these segment boundaries or generate one set of randomized
words for each entire document. But when there is no segmentation, in
either format, we cannot reconstruct one on the other format.
- The transformation from t-d-m to randomized word order document may
not always be 100% exact: it can be exact if the t-d-m contains absolute
word frequencies; it can also be exact if the t-d-m contains relative
frequencies and we know the total number of words of each document; but
it cannot be exact if we only have relative word frequencies and no
information about the orginal documents' text lengths.
- For this reason, and because segmentation cannot be recovered once a
t-d-m has been calculated or a randomization of word order has been
performed, I advocate for t-d-m representations that have some degree of
segmentation built into them; that is the more powerful representation
(but it is also, of course, the one more amenable to reconstruction,
depending on the segment size); and for keeping absolute frequencies
rather than calculating relative frequencies.
- There may of course be statistical descriptions of documents that
cannot be used to generate a document with randomized word order; or
there might be transformational derivatives that are not suitable for
building a term-document matrix. But at least the t-d-m is a pretty
standard form of text representation already (e.g., the stylo() package
for R ships with several in-copyright corpora in the form of t-d-m.
Best wishes,
Christof
On 18.06.24 15:27, Genêt, Philippe wrote:
>
> Liebe Kolleg*innen,
>
> die nächste Sitzung der AG Legal findet statt am kommenden *Dienstag,
> den 25. Juni 2024, um 11 Uhr* in diesem Zoom-Raum:
> https://zoom.us/j/93357206007?pwd=SVVDNFFyTTJkYmp3cGlKeElTS3JqUT09
> <https://zoom.us/j/93357206007?pwd=SVVDNFFyTTJkYmp3cGlKeElTS3JqUT09>
>
> Die Agenda <https://textplus.sync.academiccloud.de/f/780031>entspricht
> der vom letzten Mal: das nächste Deliverable, dessen Deadline nun noch
> näher gerückt ist.
>
> Ergänzt wie immer gerne die Punkte, über die ihr noch sprechen wollt!
>
> Bis dahin liebe Grüße
>
> Philippe
>
> --
> Philippe Genêt
>
> Koordinator DNB at Text+
>
>
> Deutsche Nationalbibliothek
> Fachbereich Informationsinfrastruktur
> Adickesallee 1
> 60322 Frankfurt am Main
>
> Telefon: +49 69 1525-1847
>
> E-Mail: p.genet at dnb.de <mailto:p.genet at dnb.de>
>
> text-plus.org <http://www.text-plus.org/>
>
> dnb.de <http://www.dnb.de/>
>
>
--
* Prof. Dr. Christof Schöch*
Professor for Digital Humanities, FB II
Co-Director, Trier Center for Digital Humanities
Trier University, Germany
https://dh.uni-trier.de
https://tcdh.uni-trier.de
-------------- nächster Teil --------------
Ein Dateianhang mit HTML-Daten wurde abgetrennt...
URL: <http://lists.dnb.de/pipermail/tp-legal/attachments/20240625/50f68c9d/attachment.htm>
-------------- nächster Teil --------------
Ein Dateianhang mit Binärdaten wurde abgetrennt...
Dateiname : UTR-logo.gif
Dateityp : image/gif
Dateigröße : 5826 bytes
Beschreibung: nicht verfügbar
URL : <http://lists.dnb.de/pipermail/tp-legal/attachments/20240625/50f68c9d/attachment.gif>
Mehr Informationen über die Mailingliste Tp-legal