Dave Stokes: Loading The Titanic Dataset Into PostgreSQL With DBeaver Part 3

In the previous post, we asked AI to make recommendations to help clean up the data loaded directly from a CSV file. The initial data load for the Name failed because a VARCHAR(64) was estimated to be insufficient to hold the data. But it was not, and a TEXT field was used instead.

So now we want to save some disk space!

The first obvious candidate may seem to be the Name field; a TEXT field can hold up to a gigabyte of data. None of the names is that long. If we switch to VARCHAR, we do not gain space, as both have the exact underlying mechanism in PostgreSQL.

PassengerId

There are only 891 records, so we don’t have to worry about growth. This is a ‘fixed’ list, as no one else is…

So now we want to save some disk space!

PassengerId

There are only 891 records, so we don’t have to worry about growth. This is a ‘fixed’ list, as no one else is going to be added to this list, so we can treat the column as having a fixed width. We can ask for the current maximum value of the columns. In the ‘real’ world, when creating a table, you want your primary key column to be sized to allow for growth, hopefully enormous growth. However, in cases like this, you do not have to leave room for future growth.

SELECT MAX(“PassengerId”) as “Passengerid”,

MAX(“Survived”) as “Survived”,

MAX(“Pclass”) as “Pclass”,

MAX(LENGTH(“Name”)) as “Name”,

MAX(“Age”) as “Age”,

MAX(“SibSp”) as “SibSp”,

MAX(“Parch”) as “Parch” ,

MAX(“Fare”) as “Fare”

FROM titanic;

We can save space by using a SMALLINT for the PassengerId (reducing it from four bytes to two). And we can shrink other columns too.

ALTER TABLE public.titanic2 ALTER COLUMN “PassengerId” TYPE smallint USING “PassengerId”::smallint; ALTER TABLE public.titanic2 ALTER COLUMN “Age” TYPE int2 USING “Age”::int2; ALTER TABLE public.titanic2 ALTER COLUMN “Survived” TYPE boolean USING “Survived”::boolean; ALTER TABLE public.titanic2 ALTER COLUMN “SibSp” TYPE smallint USING “SibSp”::smallint; ALTER TABLE public.titanic2 ALTER COLUMN “Parch” TYPE smallint USING “Parch”::smallint; ALTER TABLE public.titanic2 ALTER COLUMN “Embarked” TYPE varchar(1) USING “Embarked”::varchar(1);

Using pgstattuple

We can use pgstattuple to see the results of this data size diet.



The original ‘titanic’ table



The revised ‘titanic2’ table

The length of the live tuples has shrunk from 90,111 to 83,319 bytes. And free space and free percentage have climbed too. A smaller table is loaded into memory faster, scans more quickly, and requires less space in a backup.

PassengerId

PassengerId

Using pgstattuple

Similar Posts