Thursday, April 11, 2019

The Journey of the Grand Myconautical Society

I made a traincar for Skerples' indefinite train project. After all that brainstorming, I didn't use any of it directly, but I got to use this illustration from Rattlemayne, which was the reward for entering the ItO pocketmod contest. Traincar linked from the image below.

Tuesday, April 9, 2019

Mad With Power in the Gardens of Ynn

The Gardens of Ynn is one of the most immediately exciting RPG books that I've read in a long time. In general, Cavegirl is a brilliant and exciting writer, and you can give her money if you'd like to support her work (outside of buying her other things, which are also brilliant). But man, that book could really use an editor, and the PDF has some weirdness. Now that I've been playing with Ghostscript, I thought I'd try my hand at solving one of the more egregious problems (IMO).

Problem

The layout of Ynn is roughly like this: there's a handful of tables in the beginning of the book, and to generate locations you roll on them. Each of the results on those tables is expanded upon, usually at about a page-length, later in the book. But the real problem is flipping around the book: the tables don't have page numbers, nor are they cross-referenced. Changing the text of the book or adding cross references are possibly too advanced at the moment, but I can add a table of contents, and hopefully this will make the PDF more useful at the table.

Solution

I copied-and-pasted the table of contents in Ynn, and lightly edited it (for example, "Chronological Abberations" is now "Shepherd of the Trees"). Then I went through and converted it to pdfmark format. This process was unfortunately not very automated, excepting some find-and-replace tools. Finally, I ran a command that looks like:

$ gs -o Ynn.ann.pdf -sDEVICE=pdfwrite -dPDFSETTINGS=/prepress Ynn.pdf Ynn.toc.pdfmark

Results

Now I get this nice sidebar on my PDF and I'm happy:

As a detail-oriented person, there are still lots of things in the book that bother me a little. But it's much more usable now.

Sunday, April 7, 2019

Hapaxes in Context

My last post about Hapaxes in the Ultraviolet Grasslands was well received, even though I was not satisfied with the strength of its conclusions. Here, I attempt to add context to the hapaxes through the use of some inadvisable perl scripts.

Process

Again, I'll assume some familiarity with bash. I'm also using a pair of perl scripts I wrote which probably don't generalize well, but which were fit for purpose. If I'd planned ahead, I'd have written them both as one script.

First, we'll do some very similar things to what we did last time:

$ python3-pdf2txt.py -o UVG.txt UVG.pdf
$ cat UVG.txt |
tr A-Z a-z |
sed -E "s/\s+|['‘’]s\s+|[–—-]+/\n/g" |
sed -E 's/[][<>.,();:+?!%/©&“”"#*]//g' |
sed -e "s/^['’‘]//g" |
sed -e "s/['’‘]$//g" |
grep -Ev "^[0-9d]+$" |
sort | uniq -u > UVG.hapax
$ /bin/diff -i /usr/share/dict/words UVG.hapax |
grep ">" |
cut -d " " -f2 > UVG.hapax.new

As before, linebreaks have been added for clarity, but you'll have to escape them to use this code directly. Also note that this time we remove the possessive "s" from the ends of strings, and we split on hyphens as well as spaces.

Next, we go back to the PDF, but we extract it as XML to access the character position data:

$ python3-pdf2txt.py -t xml -o UVG.xml UVG.pdf
$ cat UVG.xml | ./xml2tsv.pl > UVG.tsv
$ cat UVG.hapax.new | ./pdfmarker.pl UVG.tsv > UVG.pdfmark

The scripts I mentioned earlier are xml2tsv.pl and pdfmarker.pl. The former (very naively) strips extraneous markup, leaving each line with a character, four coordinates, and a page number. The latter reads that tsv file (given as an argument) and locates the coordinates of each word piped to it. (As each word only appears once, this is straightforward.) It outputs these coordinates in pdfmark format, a way to annotate PDFs.

Finally, we merge the annotations back into the original PDF as highlights:

$ gs -o UVG.ann.pdf -sDEVICE=pdfwrite -dPDFSETTINGS=/prepress UVG.pdf UVG.pdfmark

Results

This method is much pickier, in part because of the difficulty of parsing a PDF consistently. From the text file, we extracted 164 words of interest, but there are only 139 annotations in the final count. I expect that the difference is words that are represented differently between the txt and xml formats. For example, if the space between two words isn't represented by a whitespace character in the xml, it does not detect as a word boundary when we look for it. But the heuristics that build the text output may still correctly "add" the space back in. This method also considers each half of a hyphenated word separately, so they are more likely to appear multiple times or to be in the dictionary.

These numbers are smaller than before for a different reason also: I have been using the free sample version, so that I can share the results. This is 78 pages, down from 158 pages in the backer version I was using before. So while we can still get a list of the output as before:

rewatch
pusca
eskatin
ashwhite
demiwarlock
vidy
engobes
tollmistress
dejus
orangeware

We can also then go find where these words are highlighted in the PDF:

In the highlighted PDF, it's easier to see that the majority of the hapaxes are proper names and normal words that my dictionary doesn't contain, like "lunchbox" and "calcinous". There are still lots of gems though, like a sign that reads No Lones to Adventerers, Frybooters or Wagonbonds, the goddess Hazmaat, and zombastodon lair. You can take a look here:

Disclaimer & Plug

I still back Luka on Patreon, and I have backed his Kickstarter as well. The Kickstarter campaign is now in its final week, and I'm very excited for it.

The free version of the PDF (available unannotated in the Kickstarter description), is licensed under a CC By-NC-ND 4.0 license. Arguably, because all the changes I have made to it were procedural, maybe this still complies with the "NoDerivatives" part of that. But I don't actually know, so I went ahead and asked Luka and he said this was ok anyway.