Quantcast
Channel: Peter Organisciak
Viewing all articles
Browse latest Browse all 18

MARC Fields in the HathiTrust

$
0
0

At the HathiTrust Research Center, we’re often asked about metadata coverage for the nearly 15 million HathiTrust records. Though we provide additional derived features like author gender and language inference, the metadata is generally what arrives from institutional providers via HathiTrust. HathiTrust provides some baseline guidelines for partners, but beyond those, what you can expect is dependent on what institutional partners provide.

For a sense of what that is, below is a list of the most common MARC fields. Crunching this didn’t involve any special access through the Research Center: you could easily access the same records via HathiTrust’s Bibliographic API (and hey, some code!).

The good news is that at the scale of the HathiTrust’s collection, even a small random fraction of the full collection is sufficient to see many quantitative patterns. We see this often at the Research Center, where we work largely at aggregate levels over thousands or millions of volumes: term frequencies, topic models, and other types of distributions converge much, much earlier. The bad news is that you can’t be sure that the biases in the missing vs. included data are random. For that, you’ll have to look more closely at a field that you’re interested in.

This data is presented for curiosity with little commentary, but I will offer one pro-tip: If you’re looking to get a thematic classification of a volume, the ~50% coverage of Library of Congress call numbers is a good place to start.

Field Percentage Description
035 100 SYSTEM CONTROL NUMBER (R)
245 100 TITLE STATEMENT (NR)
538 100 SYSTEM DETAILS NOTE (R)
974 100 NA
260 99.681 PUBLICATION, DISTRIBUTION, ETC. (IMPRINT) (R)
300 98.971 PHYSICAL DESCRIPTION (R)
040 95.834 CATALOGING SOURCE (NR)
100 73.913 MAIN ENTRY–PERSONAL NAME (NR)
650 64.913 SUBJECT ADDED ENTRY–TOPICAL TERM (R)
010 51.468 LIBRARY OF CONGRESS CONTROL NUMBER (NR)
050 47.174 LIBRARY OF CONGRESS CALL NUMBER (R)
500 38.911 GENERAL NOTE (R)
504 36.012 BIBLIOGRAPHY, ETC. NOTE (R)
020 34.879 INTERNATIONAL STANDARD BOOK NUMBER (R)
490 28.972 SERIES STATEMENT (R)
700 27.899 ADDED ENTRY–PERSONAL NAME (R)
043 27.784 GEOGRAPHIC AREA CODE (NR)
090 27.604 LOCAL CALL NUMBER (BK AM CF MP MU VM SE) [OBSOLETE]
090 27.604 SHELF LOCATION (AM)[OBSOLETE]
082 20.216 DEWEY DECIMAL CLASSIFICATION NUMBER (R)
651 19.915 SUBJECT ADDED ENTRY–GEOGRAPHIC NAME (R)
250 18.072 EDITION STATEMENT (R)
710 14.503 ADDED ENTRY–CORPORATE NAME (R)
600 14.061 SUBJECT ADDED ENTRY–PERSONAL NAME (R)
049 8.733 NA
880 8.355 ALTERNATE GRAPHIC REPRESENTATION (R)
042 8.061 AUTHENTICATION CODE (NR)
041 7.361 LANGUAGE CODE (R)
110 7.092 MAIN ENTRY–CORPORATE NAME (NR)
246 6.558 VARYING FORM OF TITLE (R)
610 5.859 SUBJECT ADDED ENTRY–CORPORATE NAME (R)
740 5.814 ADDED ENTRY–UNCONTROLLED RELATED/ANALYTICAL TITLE (R)
505 5.075 FORMATTED CONTENTS NOTE (R)
015 4.986 NATIONAL BIBLIOGRAPHY NUMBER (R)

 


Viewing all articles
Browse latest Browse all 18

Trending Articles