<!DOCTYPE html>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<p>Dear all, <br>
</p>
<p>Many thanks for the great meeting today. One more thought that I
just wanted to document / not forget: <br>
</p>
<p>Format 1 (statistical derivatives, e.g.: term-document-matrix
with word frequencies) and format 2 (transformational derivatives,
e.g. documents with randomized word order) are maybe not
necessarily as distinct as they seem, depending on the exact
examples for the two cases that we consider. The reason is that
some of them can be transformed into each other, with a few
assumptions. <br>
</p>
<p>- We can use a term-document-matrix to generate plain text with
randomized word order; <br>
- We can use a document with randomized word order to generate a
term-document matrix; <br>
</p>
<p>A few observations: <br>
</p>
<p>- The relationship is not 100% symmetrical: From a document with
randomized word order, where randomization happens within a
certain and known segment size, we can either build a
term-document-matrix that maintains these segments (with separate
columns for each segment) or we can merge the frequencies for all
segments belonging to one document into a single column.
Conversely, from a t-d-m with a given segmentation performed
before calculating the t-d-m, we can generate randomized texts
respecting these segment boundaries or generate one set of
randomized words for each entire document. But when there is no
segmentation, in either format, we cannot reconstruct one on the
other format. <br>
- The transformation from t-d-m to randomized word order document
may not always be 100% exact: it can be exact if the t-d-m
contains absolute word frequencies; it can also be exact if the
t-d-m contains relative frequencies and we know the total number
of words of each document; but it cannot be exact if we only have
relative word frequencies and no information about the orginal
documents' text lengths. <br>
- For this reason, and because segmentation cannot be recovered
once a t-d-m has been calculated or a randomization of word order
has been performed, I advocate for t-d-m representations that have
some degree of segmentation built into them; that is the more
powerful representation (but it is also, of course, the one more
amenable to reconstruction, depending on the segment size); and
for keeping absolute frequencies rather than calculating relative
frequencies. <br>
- There may of course be statistical descriptions of documents
that cannot be used to generate a document with randomized word
order; or there might be transformational derivatives that are not
suitable for building a term-document matrix. But at least the
t-d-m is a pretty standard form of text representation already
(e.g., the stylo() package for R ships with several in-copyright
corpora in the form of t-d-m. <br>
<br>
</p>
Best wishes,<br>
Christof <br>
<p><br>
</p>
<p><br>
</p>
<div class="moz-cite-prefix">On 18.06.24 15:27, Genêt, Philippe
wrote:<br>
</div>
<blockquote type="cite"
cite="mid:414dd2556aa643aeb37efcfb91905195@dnb.de">
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta name="Generator"
content="Microsoft Word 15 (filtered medium)">
<style>@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}@font-face
{font-family:Verdana;
panose-1:2 11 6 4 3 5 4 4 2 4;}p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0cm;
margin-bottom:.0001pt;
font-size:12.0pt;
font-family:"Times New Roman",serif;}a:link, span.MsoHyperlink
{mso-style-priority:99;
color:blue;
text-decoration:underline;}a:visited, span.MsoHyperlinkFollowed
{mso-style-priority:99;
color:purple;
text-decoration:underline;}p.msonormal0, li.msonormal0, div.msonormal0
{mso-style-name:msonormal;
mso-margin-top-alt:auto;
margin-right:0cm;
mso-margin-bottom-alt:auto;
margin-left:0cm;
font-size:12.0pt;
font-family:"Times New Roman",serif;}p.emailquote, li.emailquote, div.emailquote
{mso-style-name:emailquote;
mso-margin-top-alt:auto;
margin-right:0cm;
mso-margin-bottom-alt:auto;
margin-left:1.0pt;
border:none;
padding:0cm;
font-size:12.0pt;
font-family:"Times New Roman",serif;}span.E-MailFormatvorlage19
{mso-style-type:personal;
font-family:"Verdana",sans-serif;
color:#44546A;}span.E-MailFormatvorlage20
{mso-style-type:personal-compose;
font-family:"Verdana",sans-serif;
color:windowtext;}.MsoChpDefault
{mso-style-type:export-only;
font-size:10.0pt;}div.WordSection1
{page:WordSection1;}</style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
<div class="WordSection1">
<p class="MsoNormal"><span
style="font-size:10.0pt;font-family:"Verdana",sans-serif">Liebe
Kolleg*innen,<o:p></o:p></span></p>
<div>
<p class="MsoNormal"><span
style="font-size:10.0pt;font-family:"Verdana",sans-serif"> <o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span
style="font-size:10.0pt;font-family:"Verdana",sans-serif">die
nächste
<span style="color:#44546A">Sitzung der AG Legal </span>findet
statt am kommenden
<b>Dienstag, den 2<span style="color:#44546A">5</span>. <span
style="color:#44546A">
Juni </span>2024, um 11 Uhr</b> in diesem Zoom-Raum:
<a
href="https://zoom.us/j/93357206007?pwd=SVVDNFFyTTJkYmp3cGlKeElTS3JqUT09"
moz-do-not-send="true">
<span style="color:#0563C1">https://zoom.us/j/93357206007?pwd=SVVDNFFyTTJkYmp3cGlKeElTS3JqUT09</span></a>
<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span
style="font-size:10.0pt;font-family:"Verdana",sans-serif"> <o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span
style="font-size:10.0pt;font-family:"Verdana",sans-serif;color:black">Die
<a href="https://textplus.sync.academiccloud.de/f/780031"
moz-do-not-send="true">Agenda</a></span><span
style="font-size:10.0pt;font-family:"Verdana",sans-serif">
<span style="color:black">entspricht der vom letzten Mal:
</span>das nächste Deliverable, dessen Deadline
<span style="color:black">nun noch näher gerückt ist</span>.<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span
style="font-size:11.0pt;font-family:"Calibri",sans-serif"> </span><span
style="font-size:10.0pt;font-family:"Verdana",sans-serif"><o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span
style="font-size:10.0pt;font-family:"Verdana",sans-serif">Ergänzt
wie immer gerne die Punkte, über die ihr noch sprechen
wollt!<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span
style="font-size:10.0pt;font-family:"Verdana",sans-serif"> <o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span
style="font-size:10.0pt;font-family:"Verdana",sans-serif">Bis
dahin liebe Grüße<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span
style="font-size:10.0pt;font-family:"Verdana",sans-serif">Philippe<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span
style="font-size:10.0pt;font-family:"Verdana",sans-serif"> <o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span
style="font-size:10.0pt;font-family:"Verdana",sans-serif">--<br>
</span><span
style="font-size:9.0pt;font-family:"Verdana",sans-serif">Philippe
Genêt</span><span
style="font-size:10.0pt;font-family:"Verdana",sans-serif"><o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span
style="font-size:9.0pt;font-family:"Verdana",sans-serif">Koordinator
DNB@Text+</span><span
style="font-size:10.0pt;font-family:"Verdana",sans-serif"><o:p></o:p></span></p>
</div>
<div style="margin-bottom:12.0pt">
<p class="MsoNormal"><span
style="font-size:11.0pt;font-family:"Calibri",sans-serif"><br>
</span><span
style="font-size:9.0pt;font-family:"Verdana",sans-serif">Deutsche
Nationalbibliothek</span><span style="font-size:9.0pt">
<br>
</span><span
style="font-size:9.0pt;font-family:"Verdana",sans-serif">Fachbereich
Informationsinfrastruktur<br>
Adickesallee 1</span><span style="font-size:9.0pt"> <br>
</span><span
style="font-size:9.0pt;font-family:"Verdana",sans-serif">60322
Frankfurt am Main</span><span style="font-size:9.0pt">
</span><span
style="font-size:10.0pt;font-family:"Verdana",sans-serif"><o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span
style="font-size:9.0pt;font-family:"Verdana",sans-serif">Telefon:
+49 69 1525-1847</span><span style="font-size:9.0pt">
</span><span
style="font-size:10.0pt;font-family:"Verdana",sans-serif"><o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span
style="font-size:9.0pt;font-family:"Verdana",sans-serif">E-Mail:
<a href="mailto:p.genet@dnb.de" moz-do-not-send="true"><span
style="color:#0563C1">p.genet@dnb.de</span></a>
</span><span
style="font-size:10.0pt;font-family:"Verdana",sans-serif"><o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span
style="font-size:9.0pt;font-family:"Verdana",sans-serif"> </span><span
style="font-size:10.0pt;font-family:"Verdana",sans-serif"><o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span
style="font-size:11.0pt;font-family:"Calibri",sans-serif"><a
href="http://www.text-plus.org/" moz-do-not-send="true"><span
style="font-size:9.0pt;font-family:"Verdana",sans-serif;color:#0563C1">text-plus.org</span></a></span><span
style="font-size:10.0pt;font-family:"Verdana",sans-serif"><o:p></o:p></span></p>
</div>
<div style="margin-bottom:12.0pt">
<p class="MsoNormal"><span
style="font-size:11.0pt;font-family:"Calibri",sans-serif"><a
href="http://www.dnb.de/" moz-do-not-send="true"><span
style="font-size:9.0pt;font-family:"Verdana",sans-serif;color:#0563C1">dnb.de</span></a></span><span
style="font-size:9.0pt;font-family:"Verdana",sans-serif">
</span><span
style="font-size:10.0pt;font-family:"Verdana",sans-serif"><o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span
style="font-size:11.0pt;font-family:"Calibri",sans-serif"> </span><span
style="font-size:10.0pt;font-family:"Verdana",sans-serif"><o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span
style="font-size:11.0pt;font-family:"Calibri",sans-serif"> </span><span
style="font-size:10.0pt;font-family:"Verdana",sans-serif"><o:p></o:p></span></p>
</div>
</div>
<br>
<fieldset class="moz-mime-attachment-header"></fieldset>
</blockquote>
<div class="moz-signature">-- <br>
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<div class="moz-signature">
<meta http-equiv="content-type"
content="text/html; charset=UTF-8">
<title></title>
<div class="moz-signature"> <small>
<p><b> Prof. Dr. Christof Schöch</b> <br>
Professor for Digital Humanities, FB II <br>
Co-Director, Trier Center for Digital Humanities <br>
<img moz-do-not-send="false"
src="cid:part1.gQwXLIeQ.NPU5RXS9@uni-trier.de"
alt="Trier University, Germany" width="190"> <br>
<a href="https://dh.uni-trier.de"
class="moz-txt-link-freetext">https://dh.uni-trier.de</a>
<br>
<a href="https://tcdh.uni-trier.de"
class="moz-txt-link-freetext">https://tcdh.uni-trier.de</a></p>
</small> </div>
</div>
</div>
</body>
</html>