UK Selective Web Archive Classification Dataset. 1996 – 2010. TSV.

The dataset comprises a manually curated selective archive produced by UKWA which includes the classification of sites into a two-tiered subject hierarchy. In partnership with the Internet Archive and JISC, UKWA had obtained access to the subset of the Internet Archive’s web collection that relates to the UK. The JISC UK Web Domain Dataset (1996 – 2013) contains all of the resources from the Internet Archive that were hosted on domains ending in ‘.uk’, or that are required in order to render those UK pages. UKWA have made this manually-generated classification information available as an open dataset in Tab Separated Values (TSV) format. UKWA is particularly interested in whether high-level metadata like this can be used to train an appropriate automatic classification system so that this manually generated dataset may be used to partially automate the categorisation of the UKWA’s larger archives. UKWA expects that an appropriate classifier might require more information about each site in order to produce reliable results, and a future goal is to augment this dataset with further information. Options include: for each site, making the titles of every page on that site available, and for each site, extract a set of keywords that summarise the site, via the full-text index. For more information: http://data.webarchive.org.uk/opendata/ukwa.ds.1/classification/

Additional information

UniqueID

4dcd0215-d8d1-4b95-862e-ed355860b737

BL Dataset Provider

User Access Level

BL Labs Assistance

Contributors

Jackson, Andrew N.

Institution

UK Web Archive

Language

Contact Person

British Library Labs

Location

Repository Cloud

Official URL

https://doi.org/10.5259/ukwa.ds.1/classification/1

Is It Being Updated

Any Issues With Access

No

Files

classification.tsv.text

T&C Needed

Rights Assessment

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.