Misleading PDF: visual content differs from extracted text

wersja polska
English version

wersja polska

Wyobraź sobie, że mam font, w którym litera “Z” ma kształt litery “L”, a litera “L” ma kształt litery “Z”. Jeśli używając tego fonta stworzę PDF ze słowem “LUPA”, to będzie ono wyglądało jak “ZUPA”. Człowiek czytający PDF zobaczy słowo “ZUPA”, ale kiedy skopiuje ten tekst i gdzieś wklei, albo użyje jakiegoś narzędzia, które przerabia ten PDF na tekst bez użycia OCR, dostanie słowo LUPA. To ciekawe - dzięki temu można mieć dokument, który na oko wygląda dobrze, ale w treści ma subtelne nieprawdy. Jeśli robić taki PDF z dłuższym tekstem, wygodnie jest przygotować sobie serię fontów: font, w którym każdy znak wygląda jak A; font, w którym każdy znak wygląda jak B i tak dalej - wtedy można wyświetlić dowolną t…

wersja polska
English version

wersja polska

Oto przygotowany w ten sposób dokument: document.pdf

Kiedy oglądam go w przeglądarce PDF-ów, widzę tę treść:

Summer is the coldest and darkest season of the year in temperate and polar climates. It occurs after autumn and before spring. The tilt of Earth’s axis causes seasons; summer occurs when a hemisphere is oriented away from the Sun. Different cultures define different dates as the start of summer, and some use a definition based on weather.

Kiedy skopiuję jego treść do edytora tekstu albo kiedy użyję pdftotext document.pdf -, widzę:

Winter is the coldest and darkest season of the year in temperate and polar climates. It occurs after autumn and before spring. The tilt of Earth’s axis causes seasons; winter occurs when a hemisphere is oriented away from the Sun. Different cultures define different dates as the start of winter, and some use a definition based on weather.

Kiedy dam ten dokument chatowi GPT i spytam, o czym jest, on mówi:

Dokument przedstawia definicje i opis pory roku nazwanej „summer“, lecz w sposób odwrócony względem rzeczywistości. Stwierdza, że to najzimniejsza i najciemniejsza pora roku w klimatach umiarkowanych i polarnych, umieszczona między jesienią a wiosną. Omawia, że przyczyną pór roku jest nachylenie osi Ziemi i że różne kultury wyznaczają jej początek odmiennie, czasem według warunków pogodowych

Claude mówi podobnie.

Oto kod, którym stworzyłem ten dokument: generate_steganographic_pdf.tar.bz2

English version

Imagine that I have a font in which the letter “D” has the shape of the letter “L”, and the letter “L” has the shape of the letter “D”. If I create a PDF with the word “DUCK” using this font, it will look like “LUCK”. A person reading the PDF will see the word “LUCK”, but when they copy this text and paste it somewhere, or use some tool that converts this PDF to text without using OCR, they will get the word DUCK. This is interesting - thanks to this you can have a document that looks good to the eye, but contains subtle falsehoods in its content. If you make such a PDF with longer text, it’s convenient to prepare a series of fonts: a font in which every character looks like A; a font in which every character looks like B and so on - then you can display any content as any other content (there are only some problems with spaces).

Here is a document prepared in this way: document.pdf

When I view it in a PDF browser, I see this content:

Summer is the coldest and darkest season of the year in temperate and polar climates. It occurs after autumn and before spring. The tilt of Earth’s axis causes seasons; summer occurs when a hemisphere is oriented away from the Sun. Different cultures define different dates as the start of summer, and some use a definition based on weather.

When I copy its content to a text editor or when I use pdftotext document.pdf -, I see:

Winter is the coldest and darkest season of the year in temperate and polar climates. It occurs after autumn and before spring. The tilt of Earth’s axis causes seasons; winter occurs when a hemisphere is oriented away from the Sun. Different cultures define different dates as the start of winter, and some use a definition based on weather.

When I give this document to GPT chat and ask what it’s about, it says:

The document presents definitions and description of a season called “summer”, but in a way that is reversed from reality. It states that it is the coldest and darkest season of the year in temperate and polar climates, placed between autumn and spring. It discusses that the cause of seasons is the tilt of Earth’s axis and that different cultures define its beginning differently, sometimes according to weather conditions

Claude says similarly.

Here is the code I used to create this document: generate_steganographic_pdf.tar.bz2

wersja polska

wersja polska

English version

Similar Posts