Showing posts with label data. Show all posts
Showing posts with label data. Show all posts

Sunday, April 7, 2019

Hapaxes in Context

My last post about Hapaxes in the Ultraviolet Grasslands was well received, even though I was not satisfied with the strength of its conclusions. Here, I attempt to add context to the hapaxes through the use of some inadvisable perl scripts.

Process

Again, I'll assume some familiarity with bash. I'm also using a pair of perl scripts I wrote which probably don't generalize well, but which were fit for purpose. If I'd planned ahead, I'd have written them both as one script.

First, we'll do some very similar things to what we did last time:

$ python3-pdf2txt.py -o UVG.txt UVG.pdf
$ cat UVG.txt |
tr A-Z a-z |
sed -E "s/\s+|['‘’]s\s+|[–—-]+/\n/g" |
sed -E 's/[][<>.,();:+?!%/©&“”"#*]//g' |
sed -e "s/^['’‘]//g" |
sed -e "s/['’‘]$//g" |
grep -Ev "^[0-9d]+$" |
sort | uniq -u > UVG.hapax
$ /bin/diff -i /usr/share/dict/words UVG.hapax |
grep ">" |
cut -d " " -f2 > UVG.hapax.new

As before, linebreaks have been added for clarity, but you'll have to escape them to use this code directly. Also note that this time we remove the possessive "s" from the ends of strings, and we split on hyphens as well as spaces.

Next, we go back to the PDF, but we extract it as XML to access the character position data:

$ python3-pdf2txt.py -t xml -o UVG.xml UVG.pdf
$ cat UVG.xml | ./xml2tsv.pl > UVG.tsv
$ cat UVG.hapax.new | ./pdfmarker.pl UVG.tsv > UVG.pdfmark

The scripts I mentioned earlier are xml2tsv.pl and pdfmarker.pl. The former (very naively) strips extraneous markup, leaving each line with a character, four coordinates, and a page number. The latter reads that tsv file (given as an argument) and locates the coordinates of each word piped to it. (As each word only appears once, this is straightforward.) It outputs these coordinates in pdfmark format, a way to annotate PDFs.

Finally, we merge the annotations back into the original PDF as highlights:

$ gs -o UVG.ann.pdf -sDEVICE=pdfwrite -dPDFSETTINGS=/prepress UVG.pdf UVG.pdfmark

Results

This method is much pickier, in part because of the difficulty of parsing a PDF consistently. From the text file, we extracted 164 words of interest, but there are only 139 annotations in the final count. I expect that the difference is words that are represented differently between the txt and xml formats. For example, if the space between two words isn't represented by a whitespace character in the xml, it does not detect as a word boundary when we look for it. But the heuristics that build the text output may still correctly "add" the space back in. This method also considers each half of a hyphenated word separately, so they are more likely to appear multiple times or to be in the dictionary.

These numbers are smaller than before for a different reason also: I have been using the free sample version, so that I can share the results. This is 78 pages, down from 158 pages in the backer version I was using before. So while we can still get a list of the output as before:

rewatch
pusca
eskatin
ashwhite
demiwarlock
vidy
engobes
tollmistress
dejus
orangeware

We can also then go find where these words are highlighted in the PDF:

In the highlighted PDF, it's easier to see that the majority of the hapaxes are proper names and normal words that my dictionary doesn't contain, like "lunchbox" and "calcinous". There are still lots of gems though, like a sign that reads No Lones to Adventerers, Frybooters or Wagonbonds, the goddess Hazmaat, and zombastodon lair. You can take a look here:

Disclaimer & Plug

I still back Luka on Patreon, and I have backed his Kickstarter as well. The Kickstarter campaign is now in its final week, and I'm very excited for it.

The free version of the PDF (available unannotated in the Kickstarter description), is licensed under a CC By-NC-ND 4.0 license. Arguably, because all the changes I have made to it were procedural, maybe this still complies with the "NoDerivatives" part of that. But I don't actually know, so I went ahead and asked Luka and he said this was ok anyway.

Tuesday, March 19, 2019

Hapaxes in the Ultraviolet Grasslands

At the beginning of the glossary of the Ultraviolet Grasslands (UVG), Luka asks: What have I missed? What needs more details? One way to find things that might be missing is to look for hapaxes in the work. This is not a good plan, but I tried anyway.

Process

The following stuff was done in bash. I assume some familiarity with the commands, but comment on particular decisions that I made. It could be cleaned up.

First, we need the corpus as text so that we can work with it:

> python3-pdf2txt.py -o UVG.txt UVG.pdf

Then we clean up the text, and select all the words that only appear once:

> cat UVG.txt |
tr A-Z a-z |
sed -e 's/\s/\n/g' |
sed -E 's/[][<>.,();:+?!%/©&]//g' |
sed -e "s/[‘’]/'/g" |
sed -e 's/[“”"]//g' |
sed -e 's/[–—]/-/g' |
sed -e 's/[-"'\'']$//g' |
sed -e 's/^[-"'\'']//g' |
grep -Ev "^[-0-9'd]+$" |
sort | uniq -u > UVG.hapax

Line breaks have been added for clarity. Parts of this bear closer examination:

sed -e 's/[“”"]//g' |

This could be folded into the second sed statement, but it might be useful to keep but normalize double quotes for some purposes.

sed -e 's/[-"'\'']$//g' |
sed -e 's/^[-"'\'']//g' |

Quotes and hyphens at the beginning or end of a word are unlikely to carry much information, so they are stripped. This must happen after all the dash and quote characters have been "normalized".

Lots of the words that only appear once (6832 now) are not exciting. So we'll remove all the dictionary words:

> /bin/diff -i /usr/share/dict/words UVG.hapax |
grep ">" |
cut -d " " -f2 > UVG.hapax.new

Again, line breaks have been added for clarity. The full path to diff is specified because I've otherwise aliased diff to colordiff.

Results

Of the 1612 hapaxes now left, it might be interesting to see how the characters are distributed.

> cat UVG.hapax.new | fold -c1 | sort | uniq -c | sort -gr

This gives a table of character frequency:

3223
1647e
1295a
1155i
1106o
1097r
1016s
1010n
916t
877l
837-
. . .
   43
3 
38
2ô
2ç
29
27
1Ö
1ñ
1ë
1â

The most common "character" is blank, and I suspect this is related to newlines (3223=2*1612-1). The other "blank" character appears to be a space that did not get stripped out initially, or which was later re-introduced. Perhaps it is some kind of other whitespace.

The most exciting thing in this table (I think) is the high occurrence of the hyphen. This means that roughly half of the "hapaxes" are likely composite words, and worth considering separately. For example:

sub-node
six-lives
noble-pillared
mercy-is-weakness
marrow-beet
curse-maddened
six-limbed
force-glass
stock-piled
self-regenerating

Disregarding hyphens, these are all words a dictionary knows, but which Luka may be using in novel ways.

The remaining (unhyphenated) words, are a mixed bag. Take this random sampling:

pyrokinetic
skalin
psionics
dustland
irshe
replicator
10x
visec
mearls
mirodar

Many of these just show the limitations of my dictionary ("pyrokinetic", "replicator"). Some of them show the limitations of the process ("10x", "jrientsblogspotcom"). Some are ad-hoc compound words ("dustland", "malicereflective"). The rest are either made-up, proper nouns, or typos, and I don't have a way to distinguish between them. It's possible that some of these were "created" by pdf2txt, which uses tunable heuristics to decide where to draw word boundaries.

If you're interested in playing with the lists, I've uploaded them here. They are split into "hyphen" and "nohyphen", and should be alphabetical.

Disclaimer & Plug

I back Luka on Patreon at the $1/mo level, which grants me access to early drafts of his projects. The version used for this project was the most recent version available to backers, but it has not been edited.

UVG is currently running a Kickstarter for a fancy printed version with editing and more art. There's a link to a free version of the manuscript there too.

Wednesday, August 7, 2013

Anniversary and Crowdfunding Analysis

I apparently missed my one-year anniversary, but the blog's been slow recently. A lot has happened in the last half-year: I've graduated, I've moved (Boston to Buffalo), I've found employment. Unfortunately, I may not be gaming as much any more, but we'll see how that plays out long term.

Just because I wasn't blogging doesn't mean I haven't been busy, and one of the things I worked on was an analysis of Kickstarter data. Because it was for a class, it assumes a certain vocabulary, has some weird stylistic artifacts, and has some persistent errors that weren't severe enough to merit fixing at the time. Eventually I would like to revisit this more completely, but until then I may as well "publish" it:
The Paper
The Handout
The Presentation
I would like to dedicate this to Erik Tenkar, whose sharp coverage of Kickstarter campaigns made me to think that this might be a worthwhile project.

Looking back at this post, it's very much about myself. I can't be sure that I'll keep the blog up, but I do know that I've got at least a few more posts in me and that they'll be more gaming-related than this one.
Update: The dropbox links are all broken now, use this link instead.