Figure 1: Example of an image pair from Reyes Ayala’s dataset of a live web page (left screenshot) and an archived web page (right screenshot).
Reyes Ayala, B. (2025). Towards a better QA process: Automatic detection of quality problems in archived websites using visual comparisons. In M. Cornia et al. (Eds.). Proceedings of the 21st Conference on Information and Research science Connecting to Digital and Library science, 3937. Udine, Italy: CEUR Workshop Proceedings https://ceur-ws.org/Vol-3937/
Measuring Visual Correspondence With Web Archiving Screenshot Compare Tool
Reyes Ayala created the Web Archiving Screenshot Compare tool which assists with automating quality assurance by comparing screenshots of the live web page and the archived web page and determining the visual correspondence. The process for generating screenshots involves several steps. First, the tool reads the settings file that contains the seed list. For each seed, it checks if the web page exists and, if so, takes a screenshot. Next, the tool creates a CSV file with a list of the Archive-It URI-Ms associated with the archived versions of the seed. Then, the Web Archiving Screenshot Compare tool takes a screenshot of each archived web page. Finally, the URI-Ms and their screenshot file names are written to a CSV file.
After the screenshots are taken, the Web Archiving Screenshot Compare tool can use an image similarity metric to compute a score. Before computing a score, this tool checks if the live web page screenshot is not blank and then will crop a screenshot from the (live screenshot and archived screenshot) pair if both images are different sizes. After the score is computed it is output to a CSV file.
Determining the Effectiveness of the Image Similarity Metrics
The image similarity metrics supported by the Web Archiving Screenshot Compare tool are Structural Similarity Index (SSIM), Mean Squared Error (MSE), Normalized Root Mean Square Error (NRMSE), Perceptual Hash (P-Hash), Peak Signal to Noise Ratio (PSNR), and a percentage similarity metric that Reyes Ayala created. Three of these metrics (MSE, P-Hash, and PSNR) were discarded from her evaluation, because these metrics did not have an upper bound. The percentage similarity metric was also discarded, because it had a strong negative correlation with NRMSE.