The Benign Brown Beast: Hapaxes in the Ultraviolet Grasslands

At the beginning of the glossary of the Ultraviolet Grasslands (UVG), Luka asks: What have I missed? What needs more details? One way to find things that might be missing is to look for hapaxes in the work. This is not a good plan, but I tried anyway.

Process

The following stuff was done in bash. I assume some familiarity with the commands, but comment on particular decisions that I made. It could be cleaned up.

First, we need the corpus as text so that we can work with it:

> python3-pdf2txt.py -o UVG.txt UVG.pdf

Then we clean up the text, and select all the words that only appear once:


> cat UVG.txt |

tr A-Z a-z |

sed -e 's/\s/\n/g' |

sed -E 's/[][<>.,();:+?!%/©&]//g' |

sed -e "s/[‘’]/'/g" |

sed -e 's/[“”"]//g' |

sed -e 's/[–—]/-/g' |

sed -e 's/[-"'\'']$//g' |

sed -e 's/^[-"'\'']//g' |

grep -Ev "^[-0-9'd]+$" |

sort | uniq -u > UVG.hapax

Line breaks have been added for clarity. Parts of this bear closer examination:

sed -e 's/[“”"]//g' |

This could be folded into the second sed statement, but it might be useful to keep but normalize double quotes for some purposes.


sed -e 's/[-"'\'']$//g' |

sed -e 's/^[-"'\'']//g' |

Quotes and hyphens at the beginning or end of a word are unlikely to carry much information, so they are stripped. This must happen after all the dash and quote characters have been "normalized".

Lots of the words that only appear once (6832 now) are not exciting. So we'll remove all the dictionary words:


> /bin/diff -i /usr/share/dict/words UVG.hapax |

grep ">" |

cut -d " " -f2 > UVG.hapax.new

Again, line breaks have been added for clarity. The full path to diff is specified because I've otherwise aliased diff to colordiff.

Results

Of the 1612 hapaxes now left, it might be interesting to see how the characters are distributed.

> cat UVG.hapax.new | fold -c1 | sort | uniq -c | sort -gr

This gives a table of character frequency:

3223
1647	e
1295	a
1155	i
1106	o
1097	r
1016	s
1010	n
916	t
877	l
837	-

. . .

4	3
3
3	8
2	ô
2	ç
2	9
2	7
1	Ö
1	ñ
1	ë
1	â

The most common "character" is blank, and I suspect this is related to newlines (3223=2*1612-1). The other "blank" character appears to be a space that did not get stripped out initially, or which was later re-introduced. Perhaps it is some kind of other whitespace.

The most exciting thing in this table (I think) is the high occurrence of the hyphen. This means that roughly half of the "hapaxes" are likely composite words, and worth considering separately. For example:

sub-node six-lives noble-pillared mercy-is-weakness marrow-beet curse-maddened six-limbed force-glass stock-piled self-regenerating

Disregarding hyphens, these are all words a dictionary knows, but which Luka may be using in novel ways.

The remaining (unhyphenated) words, are a mixed bag. Take this random sampling:

pyrokinetic skalin psionics dustland irshe replicator 10x visec mearls mirodar

Many of these just show the limitations of my dictionary ("pyrokinetic", "replicator"). Some of them show the limitations of the process ("10x", "jrientsblogspotcom"). Some are ad-hoc compound words ("dustland", "malicereflective"). The rest are either made-up, proper nouns, or typos, and I don't have a way to distinguish between them. It's possible that some of these were "created" by pdf2txt, which uses tunable heuristics to decide where to draw word boundaries.

If you're interested in playing with the lists, I've uploaded them here. They are split into "hyphen" and "nohyphen", and should be alphabetical.

Disclaimer & Plug

I back Luka on Patreon at the $1/mo level, which grants me access to early drafts of his projects. The version used for this project was the most recent version available to backers, but it has not been edited.

UVG is currently running a Kickstarter for a fancy printed version with editing and more art. There's a link to a free version of the manuscript there too.

The Benign Brown Beast

Pages

Tuesday, March 19, 2019

Hapaxes in the Ultraviolet Grasslands

Process

Results

Disclaimer & Plug

No comments:

Post a Comment