At the HathiTrust Research Center, we’re often asked about metadata coverage for the nearly 15 million HathiTrust records. Though we provide additional derived features like author gender and language inference, the metadata is generally what arrives from institutional providers via HathiTrust. HathiTrust provides some baseline guidelines for partners, but beyond those, what you can expect is dependent on what institutional partners provide.
For a sense of what that is, below is a list of the most common MARC fields. Crunching this didn’t involve any special access through the Research Center: you could easily access the same records via HathiTrust’s Bibliographic API (and hey, some code!).
The good news is that at the scale of the HathiTrust’s collection, even a small random fraction of the full collection is sufficient to see many quantitative patterns. We see this often at the Research Center, where we work largely at aggregate levels over thousands or millions of volumes: term frequencies, topic models, and other types of distributions converge much, much earlier. The bad news is that you can’t be sure that the biases in the missing vs. included data are random. For that, you’ll have to look more closely at a field that you’re interested in.
This data is presented for curiosity with little commentary, but I will offer one pro-tip: If you’re looking to get a thematic classification of a volume, the ~50% coverage of Library of Congress call numbers is a good place to start.
Field | Percentage | Description |
---|---|---|
035 | 100 | SYSTEM CONTROL NUMBER (R) |
245 | 100 | TITLE STATEMENT (NR) |
538 | 100 | SYSTEM DETAILS NOTE (R) |
974 | 100 | NA |
260 | 99.681 | PUBLICATION, DISTRIBUTION, ETC. (IMPRINT) (R) |
300 | 98.971 | PHYSICAL DESCRIPTION (R) |
040 | 95.834 | CATALOGING SOURCE (NR) |
100 | 73.913 | MAIN ENTRY–PERSONAL NAME (NR) |
650 | 64.913 | SUBJECT ADDED ENTRY–TOPICAL TERM (R) |
010 | 51.468 | LIBRARY OF CONGRESS CONTROL NUMBER (NR) |
050 | 47.174 | LIBRARY OF CONGRESS CALL NUMBER (R) |
500 | 38.911 | GENERAL NOTE (R) |
504 | 36.012 | BIBLIOGRAPHY, ETC. NOTE (R) |
020 | 34.879 | INTERNATIONAL STANDARD BOOK NUMBER (R) |
490 | 28.972 | SERIES STATEMENT (R) |
700 | 27.899 | ADDED ENTRY–PERSONAL NAME (R) |
043 | 27.784 | GEOGRAPHIC AREA CODE (NR) |
090 | 27.604 | LOCAL CALL NUMBER (BK AM CF MP MU VM SE) [OBSOLETE] |
090 | 27.604 | SHELF LOCATION (AM)[OBSOLETE] |
082 | 20.216 | DEWEY DECIMAL CLASSIFICATION NUMBER (R) |
651 | 19.915 | SUBJECT ADDED ENTRY–GEOGRAPHIC NAME (R) |
250 | 18.072 | EDITION STATEMENT (R) |
710 | 14.503 | ADDED ENTRY–CORPORATE NAME (R) |
600 | 14.061 | SUBJECT ADDED ENTRY–PERSONAL NAME (R) |
049 | 8.733 | NA |
880 | 8.355 | ALTERNATE GRAPHIC REPRESENTATION (R) |
042 | 8.061 | AUTHENTICATION CODE (NR) |
041 | 7.361 | LANGUAGE CODE (R) |
110 | 7.092 | MAIN ENTRY–CORPORATE NAME (NR) |
246 | 6.558 | VARYING FORM OF TITLE (R) |
610 | 5.859 | SUBJECT ADDED ENTRY–CORPORATE NAME (R) |
740 | 5.814 | ADDED ENTRY–UNCONTROLLED RELATED/ANALYTICAL TITLE (R) |
505 | 5.075 | FORMATTED CONTENTS NOTE (R) |
015 | 4.986 | NATIONAL BIBLIOGRAPHY NUMBER (R) |