At the beginning of the glossary of the Ultraviolet Grasslands (UVG), Luka asks: What have I missed? What needs more details?
One way to find things that might be missing is to look for hapaxes in the work. This is not a good plan, but I tried anyway.
Process
The following stuff was done in bash. I assume some familiarity with the commands, but comment on particular decisions that I made. It could be cleaned up.
First, we need the corpus as text so that we can work with it:
> python3-pdf2txt.py -o UVG.txt UVG.pdf
Then we clean up the text, and select all the words that only appear once:
> cat UVG.txt |
tr A-Z a-z |
sed -e 's/\s/\n/g' |
sed -E 's/[][<>.,();:+?!%/©&]//g' |
sed -e "s/[‘’]/'/g" |
sed -e 's/[“”"]//g' |
sed -e 's/[–—]/-/g' |
sed -e 's/[-"'\'']$//g' |
sed -e 's/^[-"'\'']//g' |
grep -Ev "^[-0-9'd]+$" |
sort | uniq -u > UVG.hapax
Line breaks have been added for clarity. Parts of this bear closer examination:
sed -e 's/[“”"]//g' |
This could be folded into the second sed
statement, but it might be useful to keep but normalize double quotes for some purposes.
sed -e 's/[-"'\'']$//g' |
sed -e 's/^[-"'\'']//g' |
Quotes and hyphens at the beginning or end of a word are unlikely to carry much information, so they are stripped. This must happen after all the dash and quote characters have been "normalized".
Lots of the words that only appear once (6832 now) are not exciting. So we'll remove all the dictionary words:
> /bin/diff -i /usr/share/dict/words UVG.hapax |
grep ">" |
cut -d " " -f2 > UVG.hapax.new
Again, line breaks have been added for clarity. The full path to diff
is specified because I've otherwise aliased diff
to colordiff
.
Results
Of the 1612 hapaxes now left, it might be interesting to see how the characters are distributed.
> cat UVG.hapax.new | fold -c1 | sort | uniq -c | sort -gr
This gives a table of character frequency:
3223 | |
1647 | e |
1295 | a |
1155 | i |
1106 | o |
1097 | r |
1016 | s |
1010 | n |
916 | t |
877 | l |
837 | - |
4 | 3 |
3 | |
3 | 8 |
2 | ô |
2 | ç |
2 | 9 |
2 | 7 |
1 | Ö |
1 | ñ |
1 | ë |
1 | â |
The most common "character" is blank, and I suspect this is related to newlines (3223=2*1612-1). The other "blank" character appears to be a space that did not get stripped out initially, or which was later re-introduced. Perhaps it is some kind of other whitespace.
The most exciting thing in this table (I think) is the high occurrence of the hyphen. This means that roughly half of the "hapaxes" are likely composite words, and worth considering separately. For example:
sub-nodesix-lives
noble-pillared
mercy-is-weakness
marrow-beet
curse-maddened
six-limbed
force-glass
stock-piled
self-regenerating
Disregarding hyphens, these are all words a dictionary knows, but which Luka may be using in novel ways.
The remaining (unhyphenated) words, are a mixed bag. Take this random sampling:
pyrokineticskalin
psionics
dustland
irshe
replicator
10x
visec
mearls
mirodar
Many of these just show the limitations of my dictionary ("pyrokinetic", "replicator"). Some of them show the limitations of the process ("10x", "jrientsblogspotcom"). Some are ad-hoc compound words ("dustland", "malicereflective"). The rest are either made-up, proper nouns, or typos, and I don't have a way to distinguish between them. It's possible that some of these were "created" by pdf2txt
, which uses tunable heuristics to decide where to draw word boundaries.
If you're interested in playing with the lists, I've uploaded them here. They are split into "hyphen" and "nohyphen", and should be alphabetical.
Disclaimer & Plug
I back Luka on Patreon at the $1/mo level, which grants me access to early drafts of his projects. The version used for this project was the most recent version available to backers, but it has not been edited.
UVG is currently running a Kickstarter for a fancy printed version with editing and more art. There's a link to a free version of the manuscript there too.
No comments:
Post a Comment