I’m on the Job Market!
I’m an information scientist with a digital humanities background, specializing in large-scale text analysis, crowds systems, and information retrieval over novel datasets. Look at my CV, or contact me...
View ArticleYour First Twitter Bot, in 20 minutes
I think it was the Pres. at dawn with the Spin Back Knuckle. — bad Clue guesses (@BadClues) September 6, 2015 Creating a Twitter bot is a great exercise for formalizing a simple concept in a concrete...
View ArticleMARC Fields in the HathiTrust
At the HathiTrust Research Center, we’re often asked about metadata coverage for the nearly 15 million HathiTrust records. Though we provide additional derived features like author gender and language...
View ArticleGit tip: Automatically converting iPython notebook READMEs to Markdown
A small but useful tip today, on using iPython notebooks for a git project README while keeping an auto-generated version in the Markdown format that Github prefers. I’m in the midst of refreshing and...
View ArticleHTRC Feature Reader 2.0
I’ve released an overhaul of the HTRC Feature Reader, a Python library that makes it easy to work with the Extracted Features (EF) dataset from the HathiTrust. EF provides page-level feature counts for...
View ArticleTerm Weighting for Humanists
This post is about words. Specifically, an intuition about words, one which is meant to grasp at the aboutness of a text. The idea is simple: that not all words are equally valuable. Here, I’ll...
View ArticleA Dataset of Term Stats in Literature
Following up on Term Weighting for Humanists, I’m sharing data and code to apply term weighting to literature in the HTRC’s Extracted Features dataset. Crunched for 235,000 Language and Literature...
View ArticleUnderstanding Classified Languages in the HathiTrust
The HTRC Extracted Features (EF) dataset provides two forms of language information: the volume-level bibliographic metadata (what the library record says), as well as machine-classified tags for each...
View ArticlePico Safari: Active Gaming in Integrated Environments
With the recent release of Pokemon Go, I’m posting my presentation notes for a similar game called Pico Safari, a collaboration with Lucio Gutierrez, Garry Wong, and Calen Henry in late 2009, advised...
View ArticleBeyond tokens: what character counts say about a page
When talking about quantitative features in text analysis the term token count is king, but other features can help infer the content and context of a page. I demonstrate visually how the characters at...
View Article