UK Selective Web Archive Classification Dataset. 1996 – 2010. TSV.

The dataset comprises a manually curated selective archive produced by UKWA which includes the classification of sites into a two-tiered subject hierarchy. In partnership with the Internet Archive and JISC, UKWA had obtained access to the subset of the Internet Archive’s web collection that relates to the UK. The JISC UK Web Domain Dataset (1996 – 2013) contains all of the resources from the Internet Archive that were hosted on domains ending in ‘.uk’, or that are required in order to render those UK pages. UKWA have made this manually-generated classification information available as an open dataset in Tab Separated Values (TSV) format. UKWA is particularly interested in whether high-level metadata like this can be used to train an appropriate automatic classification system so that this manually generated dataset may be used to partially automate the categorisation of the UKWA’s larger archives. UKWA expects that an appropriate classifier might require more information about each site in order to produce reliable results, and a future goal is to augment this dataset with further information. Options include: for each site, making the titles of every page on that site available, and for each site, extract a set of keywords that summarise the site, via the full-text index. For more information:

Additional information



BL Dataset Provider

User Access Level

BL Labs Assistance


Jackson, Andrew N.


UK Web Archive


Contact Person

British Library Labs


Repository Cloud

Official URL

Is It Being Updated

Any Issues With Access




T&C Needed

Rights Assessment


There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.