[Tp-legal] REMINDER: Nächste Sitzung der AG Legal am 25. Juni

Di Jun 25 13:12:41 CEST 2024

Dear all,

Many thanks for the great meeting today. One more thought that I just 
wanted to document / not forget:

Format 1 (statistical derivatives, e.g.: term-document-matrix with word 
frequencies) and format 2 (transformational derivatives, e.g. documents 
with randomized word order) are maybe not necessarily as distinct as 
they seem, depending on the exact examples for the two cases that we 
consider. The reason is that some of them can be transformed into each 
other, with a few assumptions.

- We can use a term-document-matrix to generate plain text with 
randomized word order;
- We can use a document with randomized word order to generate a 
term-document matrix;

A few observations:

- The relationship is not 100% symmetrical: From a document with 
randomized word order, where randomization happens within a certain and 
known segment size, we can either build a term-document-matrix that 
maintains these segments (with separate columns for each segment) or we 
can merge the frequencies for all segments belonging to one document 
into a single column. Conversely, from a t-d-m with a given segmentation 
performed before calculating the t-d-m, we can generate randomized texts 
respecting these segment boundaries or generate one set of randomized 
words for each entire document. But when there is no segmentation, in 
either format, we cannot reconstruct one on the other format.
- The transformation from t-d-m to randomized word order document may 
not always be 100% exact: it can be exact if the t-d-m contains absolute 
word frequencies; it can also be exact if the t-d-m contains relative 
frequencies and we know the total number of words of each document; but 
it cannot be exact if we only have relative word frequencies and no 
information about the orginal documents' text lengths.
- For this reason, and because segmentation cannot be recovered once a 
t-d-m has been calculated or a randomization of word order has been 
performed, I advocate for t-d-m representations that have some degree of 
segmentation built into them; that is the more powerful representation 
(but it is also, of course, the one more amenable to reconstruction, 
depending on the segment size); and for keeping absolute frequencies 
rather than calculating relative frequencies.
- There may of course be statistical descriptions of documents that 
cannot be used to generate a document with randomized word order; or 
there might be transformational derivatives that are not suitable for 
building a term-document matrix. But at least the t-d-m is a pretty 
standard form of text representation already (e.g., the stylo() package 
for R ships with several in-copyright corpora in the form of t-d-m.

Best wishes,
Christof

On 18.06.24 15:27, Genêt, Philippe wrote:
>
> Liebe Kolleg*innen,
>
> die nächste Sitzung der AG Legal findet statt am kommenden *Dienstag, 
> den 25. Juni 2024, um 11 Uhr* in diesem Zoom-Raum: 
> https://zoom.us/j/93357206007?pwd=SVVDNFFyTTJkYmp3cGlKeElTS3JqUT09 
> <https://zoom.us/j/93357206007?pwd=SVVDNFFyTTJkYmp3cGlKeElTS3JqUT09>
>
> Die Agenda <https://textplus.sync.academiccloud.de/f/780031>entspricht 
> der vom letzten Mal: das nächste Deliverable, dessen Deadline nun noch 
> näher gerückt ist.
>
> Ergänzt wie immer gerne die Punkte, über die ihr noch sprechen wollt!
>
> Bis dahin liebe Grüße
>
> Philippe
>
> --
> Philippe Genêt
>
> Koordinator DNB at Text+
>
>
> Deutsche Nationalbibliothek
> Fachbereich Informationsinfrastruktur
> Adickesallee 1
> 60322 Frankfurt am Main
>
> Telefon: +49 69 1525-1847
>
> E-Mail: p.genet at dnb.de <mailto:p.genet at dnb.de>
>
> text-plus.org <http://www.text-plus.org/>
>
> dnb.de <http://www.dnb.de/>
>
>
-- 

*  Prof. Dr. Christof Schöch*
   Professor for Digital Humanities, FB II
   Co-Director, Trier Center for Digital Humanities
Trier University, Germany
https://dh.uni-trier.de
https://tcdh.uni-trier.de
-------------- nächster Teil --------------
Ein Dateianhang mit HTML-Daten wurde abgetrennt...
URL: <http://lists.dnb.de/pipermail/tp-legal/attachments/20240625/50f68c9d/attachment.htm>
-------------- nächster Teil --------------
Ein Dateianhang mit Binärdaten wurde abgetrennt...
Dateiname   : UTR-logo.gif
Dateityp    : image/gif
Dateigröße  : 5826 bytes
Beschreibung: nicht verfügbar
URL         : <http://lists.dnb.de/pipermail/tp-legal/attachments/20240625/50f68c9d/attachment.gif>