2026-01-22: Paper Summary: "Towards a better QA process: Automatic detection of quality problems in archived websites using visual comparisons"

Figure 1: Example of an image pair from Reyes Ayala’s dataset of a live web page (left screenshot) and an archived web page (right screenshot).

Reyes Ayala, B. (2025). Towards a better QA process: Automatic detection of quality problems in archived websites using visual comparisons. In M. Cornia et al. (Eds.). Proceedings of the 21st Conference on Information and Research science Connecting to Digital and Library science, 3937. Udine, Italy: CEUR Workshop Proceedings https://ceur-ws.org/Vol-3937/

Measuring Visual Correspondence With Web Archiving Screenshot Compare Tool

Reyes Ayala created the Web Archiving Screenshot Compare tool which assists with automating quality assurance b…

Figure 1: Example of an image pair from Reyes Ayala’s dataset of a live web page (left screenshot) and an archived web page (right screenshot).

Measuring Visual Correspondence With Web Archiving Screenshot Compare Tool

Reyes Ayala created the Web Archiving Screenshot Compare tool which assists with automating quality assurance by comparing screenshots of the live web page and the archived web page and determining the visual correspondence. The process for generating screenshots involves several steps. First, the tool reads the settings file that contains the seed list. For each seed, it checks if the web page exists and, if so, takes a screenshot. Next, the tool creates a CSV file with a list of the Archive-It URI-Ms associated with the archived versions of the seed. Then, the Web Archiving Screenshot Compare tool takes a screenshot of each archived web page. Finally, the URI-Ms and their screenshot file names are written to a CSV file.

After the screenshots are taken, the Web Archiving Screenshot Compare tool can use an image similarity metric to compute a score. Before computing a score, this tool checks if the live web page screenshot is not blank and then will crop a screenshot from the (live screenshot and archived screenshot) pair if both images are different sizes. After the score is computed it is output to a CSV file.

Determining the Effectiveness of the Image Similarity Metrics

The image similarity metrics supported by the Web Archiving Screenshot Compare tool are Structural Similarity Index (SSIM), Mean Squared Error (MSE), Normalized Root Mean Square Error (NRMSE), Perceptual Hash (P-Hash), Peak Signal to Noise Ratio (PSNR), and a percentage similarity metric that Reyes Ayala created. Three of these metrics (MSE, P-Hash, and PSNR) were discarded from her evaluation, because these metrics did not have an upper bound. The percentage similarity metric was also discarded, because it had a strong negative correlation with NRMSE.

The dataset that was used for the evaluation included 221 pairs of screenshots of the live and archived web pages. The archived web pages were from four Archive-It collections (Idle No More, Fort McMurray Wildfire 2016, Western Canadian Arts, and Government of Canada). After calculating the similarity scores on her dataset, she sent the screenshots to Amazon Mechanical Turk (AMT) so that she could compare the computed scores to reviewer scores. An example of an image pair that was shown to participants from Amazon Mechanical Turk is shown in Figure 1.

Reyes Ayala found that SSIM and NRMSE were able to detect high and low visual correspondence after performing statistical analysis using tests of significance. The metrics she used were one-way multivariate analysis of variance (MANOVA) and univariate analysis of variance (ANOVA). The scores for MANOVA, when using a combined dependent variable were: 𝐹(2, 222) = 44.95, 𝑝 < .001; Wilks’ 𝜆 = 0.71; Pillai’s trace = 0.29, partial 𝜂2 = 0.29 . The scores (with a Bonferroni 𝛼 adjusted level of .025) for the univariate ANOVAs were: SSIM: 𝐹(1, 223) = 10.53, 𝑝 = .001; partial 𝜂2 = 0.05 and NRMSE: 𝐹(1, 223) = 89.52, 𝑝 < .001; partial 𝜂2 = 0.29.

Reyes Ayala created the Web Archiving Screenshot Compare tool, which is used to determine the quality of an archived web page by comparing the screenshots of the live web page and the archived web page. After creating a dataset of 221 pairs of screenshots and retrieving human review scores, she performed statistical analysis using tests of significance. She found that the Structural Similarity Index (SSIM) and Normalized Root Mean Square Error (NRMSE) were able to distinguish between high quality and low quality archived web pages.

For our web archiving livestreams, we currently measure the performance of the web archive crawler during the livestream and plan to measure replay performance during future livestreams. Writing this paper summary has helped with learning about another approach that could be used to measure visual correspondence during a web archiving livestream.

Measuring Visual Correspondence With Web Archiving Screenshot Compare Tool

Measuring Visual Correspondence With Web Archiving Screenshot Compare Tool

Determining the Effectiveness of the Image Similarity Metrics

Similar Posts