Pages

Sunday, April 7, 2019

Hapaxes in Context

My last post about Hapaxes in the Ultraviolet Grasslands was well received, even though I was not satisfied with the strength of its conclusions. Here, I attempt to add context to the hapaxes through the use of some inadvisable perl scripts.

Process

Again, I'll assume some familiarity with bash. I'm also using a pair of perl scripts I wrote which probably don't generalize well, but which were fit for purpose. If I'd planned ahead, I'd have written them both as one script.

First, we'll do some very similar things to what we did last time:

$ python3-pdf2txt.py -o UVG.txt UVG.pdf
$ cat UVG.txt |
tr A-Z a-z |
sed -E "s/\s+|['‘’]s\s+|[–—-]+/\n/g" |
sed -E 's/[][<>.,();:+?!%/©&“”"#*]//g' |
sed -e "s/^['’‘]//g" |
sed -e "s/['’‘]$//g" |
grep -Ev "^[0-9d]+$" |
sort | uniq -u > UVG.hapax
$ /bin/diff -i /usr/share/dict/words UVG.hapax |
grep ">" |
cut -d " " -f2 > UVG.hapax.new

As before, linebreaks have been added for clarity, but you'll have to escape them to use this code directly. Also note that this time we remove the possessive "s" from the ends of strings, and we split on hyphens as well as spaces.

Next, we go back to the PDF, but we extract it as XML to access the character position data:

$ python3-pdf2txt.py -t xml -o UVG.xml UVG.pdf
$ cat UVG.xml | ./xml2tsv.pl > UVG.tsv
$ cat UVG.hapax.new | ./pdfmarker.pl UVG.tsv > UVG.pdfmark

The scripts I mentioned earlier are xml2tsv.pl and pdfmarker.pl. The former (very naively) strips extraneous markup, leaving each line with a character, four coordinates, and a page number. The latter reads that tsv file (given as an argument) and locates the coordinates of each word piped to it. (As each word only appears once, this is straightforward.) It outputs these coordinates in pdfmark format, a way to annotate PDFs.

Finally, we merge the annotations back into the original PDF as highlights:

$ gs -o UVG.ann.pdf -sDEVICE=pdfwrite -dPDFSETTINGS=/prepress UVG.pdf UVG.pdfmark

Results

This method is much pickier, in part because of the difficulty of parsing a PDF consistently. From the text file, we extracted 164 words of interest, but there are only 139 annotations in the final count. I expect that the difference is words that are represented differently between the txt and xml formats. For example, if the space between two words isn't represented by a whitespace character in the xml, it does not detect as a word boundary when we look for it. But the heuristics that build the text output may still correctly "add" the space back in. This method also considers each half of a hyphenated word separately, so they are more likely to appear multiple times or to be in the dictionary.

These numbers are smaller than before for a different reason also: I have been using the free sample version, so that I can share the results. This is 78 pages, down from 158 pages in the backer version I was using before. So while we can still get a list of the output as before:

rewatch
pusca
eskatin
ashwhite
demiwarlock
vidy
engobes
tollmistress
dejus
orangeware

We can also then go find where these words are highlighted in the PDF:

In the highlighted PDF, it's easier to see that the majority of the hapaxes are proper names and normal words that my dictionary doesn't contain, like "lunchbox" and "calcinous". There are still lots of gems though, like a sign that reads No Lones to Adventerers, Frybooters or Wagonbonds, the goddess Hazmaat, and zombastodon lair. You can take a look here:

Disclaimer & Plug

I still back Luka on Patreon, and I have backed his Kickstarter as well. The Kickstarter campaign is now in its final week, and I'm very excited for it.

The free version of the PDF (available unannotated in the Kickstarter description), is licensed under a CC By-NC-ND 4.0 license. Arguably, because all the changes I have made to it were procedural, maybe this still complies with the "NoDerivatives" part of that. But I don't actually know, so I went ahead and asked Luka and he said this was ok anyway.

3 comments:

  1. Cool ^_^ ... and fun to rediscover the characters from the text, extracted and showcased.

    ReplyDelete
  2. Very, very late to this but super interesting!

    ReplyDelete
    Replies
    1. I always enjoy knowing people still read the old posts! Especially let me know if you have any ideas or suggestions.

      Delete