Quick Links

Re: [HACKERS] Custom compression methods

Lists:	pgsql-hackers

From:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Custom compression methods
Date:	2017-09-07 16:42:36
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello hackers!

I've attached a patch that implements custom compression
methods. This patch is based on Nikita Glukhov's code (which he hasn't
publish in mailing lists) for jsonb compression. This is early but
working version of the patch, and there are still few fixes and features
that should be implemented (like pg_dump support and support of
compression options for types), and it requires more testing. But I'd
like to get some feedback at the current stage first.

There's been a proposal [1] of Alexander Korotkov and some discussion
about custom compression methods before. This is an implementation of
per-datum compression. Syntax is similar to the one in proposal but not
the same.

Syntax:

CREATE COMPRESSION METHOD <cmname> HANDLER <compression_handler>;
DROP COMPRESSION METHOD <cmname>;

Compression handler is a function that returns a structure containing
compression routines:

- configure - function called when the compression method applied to
an attribute
- drop - called when the compression method is removed from an attribute
- compress - compress function
- decompress - decompress function

User can create compressed columns with the commands below:

CREATE TABLE t(a tsvector COMPRESSED <cmname> WITH <options>);
ALTER TABLE t ALTER COLUMN a SET COMPRESSED <cmname> WITH <options>;
ALTER TABLE t ALTER COLUMN a SET NOT COMPRESSED;

Also there is syntax of binding compression methods to types:

ALTER TYPE <type> SET COMPRESSED <cmname>;
ALTER TYPE <type> SET NOT COMPRESSED;

There are two new tables in the catalog, pg_compression and
pg_compression_opt. pg_compression is used as storage of compression
methods, and pg_compression_opt is used to store specific compression
options for particular column.

When user binds a compression method to some column a new record in
pg_compression_opt is created and all further attribute values will
contain compression options Oid while old values will remain unchanged.
And when we alter a compression method for
the attribute it won't change previous record in pg_compression_opt.
Instead it'll create a new one and new values will be stored
with new Oid. That way there is no need of recompression of the old
tuples. And also tuples containing compressed datums can be copied to
other tables so records in pg_compression_opt shouldn't be removed. In
the current patch they can be removed with DROP COMPRESSION METHOD
CASCADE, but after that decompression won't be possible on compressed
tuples. Maybe CASCADE should keep compression options.

I haven't changed the base logic of working with compressed datums. It
means that custom compressed datums behave exactly the same as current
LZ compressed datums, and the logic differs only in toast_compress_datum
and toast_decompress_datum.

This patch doesn't break backward compability and should work seamlessly
with older version of database. I used one of two free bits in
`va_rawsize` from `varattrib_4b->va_compressed` as flag of custom
compressed datums. Also I renamed it to `va_info` since it contains not
only rawsize now.

The patch also includes custom compression method for tsvector which is
used in tests.

[1]
https://www.postgresql.org/message-id/CAPpHfdsdTA5uZeq6MNXL5ZRuNx%2BSig4ykWzWEAfkC6ZKMDy6%3DQ%40mail.gmail.com
--
---
Ildus Kurbangaliev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Attachment	Content-Type	Size
custom_compression_methods_v1.patch	text/x-patch	288.0 KB

From:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Custom compression methods
Date:	2017-09-12 14:55:05
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 7 Sep 2017 19:42:36 +0300
Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru> wrote:

> Hello hackers!
>
> I've attached a patch that implements custom compression
> methods. This patch is based on Nikita Glukhov's code (which he hasn't
> publish in mailing lists) for jsonb compression. This is early but
> working version of the patch, and there are still few fixes and
> features that should be implemented (like pg_dump support and support
> of compression options for types), and it requires more testing. But
> I'd like to get some feedback at the current stage first.
>
> There's been a proposal [1] of Alexander Korotkov and some discussion
> about custom compression methods before. This is an implementation of
> per-datum compression. Syntax is similar to the one in proposal but
> not the same.
>
> Syntax:
>
> CREATE COMPRESSION METHOD <cmname> HANDLER <compression_handler>;
> DROP COMPRESSION METHOD <cmname>;
>
> Compression handler is a function that returns a structure containing
> compression routines:
>
> - configure - function called when the compression method applied to
> an attribute
> - drop - called when the compression method is removed from an
> attribute
> - compress - compress function
> - decompress - decompress function
>
> User can create compressed columns with the commands below:
>
> CREATE TABLE t(a tsvector COMPRESSED <cmname> WITH <options>);
> ALTER TABLE t ALTER COLUMN a SET COMPRESSED <cmname> WITH <options>;
> ALTER TABLE t ALTER COLUMN a SET NOT COMPRESSED;
>
> Also there is syntax of binding compression methods to types:
>
> ALTER TYPE <type> SET COMPRESSED <cmname>;
> ALTER TYPE <type> SET NOT COMPRESSED;
>
> There are two new tables in the catalog, pg_compression and
> pg_compression_opt. pg_compression is used as storage of compression
> methods, and pg_compression_opt is used to store specific compression
> options for particular column.
>
> When user binds a compression method to some column a new record in
> pg_compression_opt is created and all further attribute values will
> contain compression options Oid while old values will remain
> unchanged. And when we alter a compression method for
> the attribute it won't change previous record in pg_compression_opt.
> Instead it'll create a new one and new values will be stored
> with new Oid. That way there is no need of recompression of the old
> tuples. And also tuples containing compressed datums can be copied to
> other tables so records in pg_compression_opt shouldn't be removed. In
> the current patch they can be removed with DROP COMPRESSION METHOD
> CASCADE, but after that decompression won't be possible on compressed
> tuples. Maybe CASCADE should keep compression options.
>
> I haven't changed the base logic of working with compressed datums. It
> means that custom compressed datums behave exactly the same as current
> LZ compressed datums, and the logic differs only in
> toast_compress_datum and toast_decompress_datum.
>
> This patch doesn't break backward compability and should work
> seamlessly with older version of database. I used one of two free
> bits in `va_rawsize` from `varattrib_4b->va_compressed` as flag of
> custom compressed datums. Also I renamed it to `va_info` since it
> contains not only rawsize now.
>
> The patch also includes custom compression method for tsvector which
> is used in tests.
>
> [1]
> https://www.postgresql.org/message-id/CAPpHfdsdTA5uZeq6MNXL5ZRuNx%2BSig4ykWzWEAfkC6ZKMDy6%3DQ%40mail.gmail.com

Attached rebased version of the patch. Added support of pg_dump, the
code was simplified, and a separate cache for compression options was
added.

--
---
Ildus Kurbangaliev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Attachment	Content-Type	Size
custom_compression_methods_v2.patch	text/x-patch	314.4 KB

From:	Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>
To:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Custom compression methods
Date:	2017-11-01 21:05:58
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 9/12/17 10:55, Ildus Kurbangaliev wrote:
>> The patch also includes custom compression method for tsvector which
>> is used in tests.
>>
>> [1]
>> https://www.postgresql.org/message-id/CAPpHfdsdTA5uZeq6MNXL5ZRuNx%2BSig4ykWzWEAfkC6ZKMDy6%3DQ%40mail.gmail.com
> Attached rebased version of the patch. Added support of pg_dump, the
> code was simplified, and a separate cache for compression options was
> added.

I would like to see some more examples of how this would be used, so we
can see how it should all fit together.

So far, it's not clear to me that we need a compression method as a
standalone top-level object. It would make sense, perhaps, to have a
compression function attached to a type, so a type can provide a
compression function that is suitable for its specific storage.

The proposal here is very general: You can use any of the eligible
compression methods for any attribute. That seems very complicated to
manage. Any attribute could be compressed using either a choice of
general compression methods or a type-specific compression method, or
perhaps another type-specific compression method. That's a lot. Is
this about packing certain types better, or trying out different
compression algorithms, or about changing the TOAST thresholds, and so on?

Ideally, we would like something that just works, with minimal
configuration and nudging. Let's see a list of problems to be solved
and then we can discuss what the right set of primitives might be to
address them.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>
To:	Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Custom compression methods
Date:	2017-11-02 09:41:01
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, 1 Nov 2017 17:05:58 -0400
Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com> wrote:

> On 9/12/17 10:55, Ildus Kurbangaliev wrote:
> >> The patch also includes custom compression method for tsvector
> >> which is used in tests.
> >>
> >> [1]
> >> https://www.postgresql.org/message-id/CAPpHfdsdTA5uZeq6MNXL5ZRuNx%2BSig4ykWzWEAfkC6ZKMDy6%3DQ%40mail.gmail.com
> > Attached rebased version of the patch. Added support of pg_dump, the
> > code was simplified, and a separate cache for compression options
> > was added.
>
> I would like to see some more examples of how this would be used, so
> we can see how it should all fit together.
>
> So far, it's not clear to me that we need a compression method as a
> standalone top-level object. It would make sense, perhaps, to have a
> compression function attached to a type, so a type can provide a
> compression function that is suitable for its specific storage.

In this patch compression methods is suitable for MAIN and EXTENDED
storages like in current implementation in postgres. Just instead only
of LZ4 you can specify any other compression method.

Idea is not to change compression for some types, but give the user and
extension developers opportunity to change how data in some attribute
will be compressed because they know about it more than database itself.

>
> The proposal here is very general: You can use any of the eligible
> compression methods for any attribute. That seems very complicated to
> manage. Any attribute could be compressed using either a choice of
> general compression methods or a type-specific compression method, or
> perhaps another type-specific compression method. That's a lot. Is
> this about packing certain types better, or trying out different
> compression algorithms, or about changing the TOAST thresholds, and
> so on?

It is about extensibility of postgres, for example if you
need to store a lot of time series data you can create an extension that
stores array of timestamps in more optimized way, using delta encoding
or something else. I'm not sure that such specialized things should be
in core.

In case of array of timestamps in could look like this:

CREATE EXTENSION timeseries; -- some extension that provides compression
method

Extension installs a compression method:

CREATE OR REPLACE FUNCTION timestamps_compression_handler(INTERNAL)
RETURNS COMPRESSION_HANDLER AS 'MODULE_PATHNAME',
'timestamps_compression_handler' LANGUAGE C STRICT;

CREATE COMPRESSION METHOD cm1 HANDLER timestamps_compression_handler;

And user can specify it in his table:

CREATE TABLE t1 (
time_series_data timestamp[] COMPRESSED cm1;
)

I think generalization of some method to a type is not a good idea. For
some attribute you could be happy with builtin LZ4, for other you can
need more compressibility and so on.

--
---
Ildus Kurbangaliev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

From:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Custom compression methods
Date:	2017-11-02 12:28:36
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, 12 Sep 2017 17:55:05 +0300
Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru> wrote:

>
> Attached rebased version of the patch. Added support of pg_dump, the
> code was simplified, and a separate cache for compression options was
> added.
>

Attached version 3 of the patch. Rebased to the current master, removed
ALTER TYPE .. SET COMPRESSED syntax, fixed bug in compression options
cache.

--
---
Ildus Kurbangaliev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Attachment	Content-Type	Size
custom_compression_methods_v3.patch	text/x-patch	241.4 KB

From:	Craig Ringer <craig(at)2ndquadrant(dot)com>
To:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>
Cc:	Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>
Subject:	Re: Custom compression methods
Date:	2017-11-02 15:02:34
Message-ID:	CAMsr+YGm0z5571OmwyPaq94A03MaTfJq4PfR-r=uUpuidSTjeA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2 November 2017 at 17:41, Ildus Kurbangaliev
<i(dot)kurbangaliev(at)postgrespro(dot)ru> wrote:

> In this patch compression methods is suitable for MAIN and EXTENDED
> storages like in current implementation in postgres. Just instead only
> of LZ4 you can specify any other compression method.

We've had this discussion before.

Please read the "pluggable compression support" thread. See you in a
few days ;) sorry, it's kinda long.

https://www.postgresql.org/message-id/flat/20130621000900(dot)GA12425%40alap2(dot)anarazel(dot)de#20130621000900(dot)GA12425(at)alap2(dot)anarazel(dot)de

IIRC there were some concerns about what happened with pg_upgrade,
with consuming precious toast bits, and a few other things.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Oleg Bartunov <obartunov(at)gmail(dot)com>
To:	Craig Ringer <craig(at)2ndquadrant(dot)com>
Cc:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>
Subject:	Re: Custom compression methods
Date:	2017-11-05 19:22:53
Message-ID:	CAF4Au4zp-S1srXbQD5XKHFaOky+vVY5dPAUgCi8Ziyp9sTTRWg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Nov 2, 2017 at 6:02 PM, Craig Ringer <craig(at)2ndquadrant(dot)com> wrote:
> On 2 November 2017 at 17:41, Ildus Kurbangaliev
> <i(dot)kurbangaliev(at)postgrespro(dot)ru> wrote:
>
>> In this patch compression methods is suitable for MAIN and EXTENDED
>> storages like in current implementation in postgres. Just instead only
>> of LZ4 you can specify any other compression method.
>
> We've had this discussion before.
>
> Please read the "pluggable compression support" thread. See you in a
> few days ;) sorry, it's kinda long.
>
> https://www.postgresql.org/message-id/flat/20130621000900(dot)GA12425%40alap2(dot)anarazel(dot)de#20130621000900(dot)GA12425(at)alap2(dot)anarazel(dot)de
>

the proposed patch provides "pluggable" compression and let's user
decide by their own which algorithm to use.
The postgres core doesn't responsible for any patent problem.

> IIRC there were some concerns about what happened with pg_upgrade,
> with consuming precious toast bits, and a few other things.

yes, pg_upgrade may be a problem.

>
> --
> Craig Ringer http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Training & Services
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Oleg Bartunov <obartunov(at)gmail(dot)com>
Cc:	Craig Ringer <craig(at)2ndquadrant(dot)com>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>
Subject:	Re: Custom compression methods
Date:	2017-11-05 22:34:23
Message-ID:	CA+TgmoZZYX-knZoTojo-dTbTvmRw8fzFQzH=MXs9GDN3m9-QeQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Nov 5, 2017 at 2:22 PM, Oleg Bartunov <obartunov(at)gmail(dot)com> wrote:
>> IIRC there were some concerns about what happened with pg_upgrade,
>> with consuming precious toast bits, and a few other things.
>
> yes, pg_upgrade may be a problem.

A basic problem here is that, as proposed, DROP COMPRESSION METHOD may
break your database irretrievably. If there's no data compressed
using the compression method you dropped, everything is cool -
otherwise everything is broken and there's no way to recover. The
only obvious alternative is to disallow DROP altogether (or make it
not really DROP).

Both of those alternatives sound fairly unpleasant to me, but I'm not
exactly sure what to recommend in terms of how to make it better.
Ideally anything we expose as an SQL command should have a DROP
command that undoes whatever CREATE did and leaves the database in an
intact state, but that seems hard to achieve in this case.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Adam Brusselback <adambrusselback(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Oleg Bartunov <obartunov(at)gmail(dot)com>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>
Subject:	Re: Custom compression methods
Date:	2017-11-06 01:00:14
Message-ID:	CAMjNa7cPynatwNDm9JQn2eJRu7Y5ARcw09Nbg5q8O_DfiY6=ww@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

> If there's no data compressed
> using the compression method you dropped, everything is cool -
> otherwise everything is broken and there's no way to recover.
> The only obvious alternative is to disallow DROP altogether (or make it
> not really DROP).

Wouldn't whatever was using the compression method have something
marking which method was used? If so, couldn't we just scan if there is
any data using it, and if so disallow the drop, or possibly an option to allow
the drop and rewrite the table either uncompressed, or with the default
compression method?

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Oleg Bartunov <obartunov(at)gmail(dot)com>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>
Subject:	Re: Custom compression methods
Date:	2017-11-06 03:32:08
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> A basic problem here is that, as proposed, DROP COMPRESSION METHOD may
> break your database irretrievably. If there's no data compressed
> using the compression method you dropped, everything is cool -
> otherwise everything is broken and there's no way to recover. The
> only obvious alternative is to disallow DROP altogether (or make it
> not really DROP).

> Both of those alternatives sound fairly unpleasant to me, but I'm not
> exactly sure what to recommend in terms of how to make it better.
> Ideally anything we expose as an SQL command should have a DROP
> command that undoes whatever CREATE did and leaves the database in an
> intact state, but that seems hard to achieve in this case.

If the use of a compression method is tied to specific data types and/or
columns, then each of those could have a dependency on the compression
method, forcing a type or column drop if you did DROP COMPRESSION METHOD.
That would leave no reachable data using the removed compression method.
So that part doesn't seem unworkable on its face.

IIRC, the bigger concerns in the last discussion had to do with
replication, ie, can downstream servers make sense of the data.
Maybe that's not any worse than the issues you get with non-core
index AMs, but I'm not sure.

regards, tom lane

From:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>
To:	Craig Ringer <craig(at)2ndquadrant(dot)com>
Cc:	Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>
Subject:	Re: Custom compression methods
Date:	2017-11-07 09:44:27
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 2 Nov 2017 23:02:34 +0800
Craig Ringer <craig(at)2ndquadrant(dot)com> wrote:

> On 2 November 2017 at 17:41, Ildus Kurbangaliev
> <i(dot)kurbangaliev(at)postgrespro(dot)ru> wrote:
>
> > In this patch compression methods is suitable for MAIN and EXTENDED
> > storages like in current implementation in postgres. Just instead
> > only of LZ4 you can specify any other compression method.
>
> We've had this discussion before.
>
> Please read the "pluggable compression support" thread. See you in a
> few days ;) sorry, it's kinda long.
>
> https://www.postgresql.org/message-id/flat/20130621000900(dot)GA12425%40alap2(dot)anarazel(dot)de#20130621000900(dot)GA12425(at)alap2(dot)anarazel(dot)de
>
> IIRC there were some concerns about what happened with pg_upgrade,
> with consuming precious toast bits, and a few other things.
>

Thank you for the link, I didn't see that thread when I looked over
mailing lists. I read it briefly, and I can address few things
relating to my patch.

Most concerns have been related with legal issues.
Actually that was the reason I did not include any new compression
algorithms to my patch. Unlike that patch mine only provides syntax
and is just a way to give the users use their own compression algorithms
and deal with any legal issues themselves.

I use only one unused bit in header (there's still one free ;), that's
enough to determine that data is compressed or not.

I did found out that pg_upgrade doesn't work properly with my patch,
soon I will send fix for it.

--
---
Ildus Kurbangaliev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

From:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [HACKERS] Custom compression methods
Date:	2017-11-14 13:23:56
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 2 Nov 2017 15:28:36 +0300
Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru> wrote:

> On Tue, 12 Sep 2017 17:55:05 +0300
> Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru> wrote:
>
> >
> > Attached rebased version of the patch. Added support of pg_dump, the
> > code was simplified, and a separate cache for compression options
> > was added.
> >
>
> Attached version 3 of the patch. Rebased to the current master,
> removed ALTER TYPE .. SET COMPRESSED syntax, fixed bug in compression
> options cache.
>

Attached version 4 of the patch. Fixed pg_upgrade and few other bugs.

--
---
Ildus Kurbangaliev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Attachment	Content-Type	Size
custom_compression_methods_v4.patch	text/x-patch	244.5 KB

From:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Oleg Bartunov <obartunov(at)gmail(dot)com>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2017-11-15 09:09:28
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, 5 Nov 2017 17:34:23 -0500
Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> On Sun, Nov 5, 2017 at 2:22 PM, Oleg Bartunov <obartunov(at)gmail(dot)com>
> wrote:
> >> IIRC there were some concerns about what happened with pg_upgrade,
> >> with consuming precious toast bits, and a few other things.
> >
> > yes, pg_upgrade may be a problem.
>
> A basic problem here is that, as proposed, DROP COMPRESSION METHOD may
> break your database irretrievably. If there's no data compressed
> using the compression method you dropped, everything is cool -
> otherwise everything is broken and there's no way to recover. The
> only obvious alternative is to disallow DROP altogether (or make it
> not really DROP).

In the patch I use separate table for compresssion options (because
each attribute can have additional options for compression). So basicly
compressed attribute linked to compression options, not the compression
method and this method can be safely dropped.

So in the next version of the patch I can just unlink the options from
compression methods and dropping compression method will not affect
already compressed tuples. They still could be decompressed.

--
---
Ildus Kurbangaliev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>
Cc:	Oleg Bartunov <obartunov(at)gmail(dot)com>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2017-11-15 13:13:02
Message-ID:	CA+TgmoachY0-FjthGkpEDTswt3i7h1xfMzjow25d5NNuvaaw7g@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Nov 15, 2017 at 4:09 AM, Ildus Kurbangaliev
<i(dot)kurbangaliev(at)postgrespro(dot)ru> wrote:
> So in the next version of the patch I can just unlink the options from
> compression methods and dropping compression method will not affect
> already compressed tuples. They still could be decompressed.

I guess I don't understand how that can work. I mean, if somebody
removes a compression method - i.e. uninstalls the library - and you
don't have a way to make sure there are no tuples that can only be
uncompressed by that library - then you've broken the database.
Ideally, there should be a way to add a new compression method via an
extension ... and then get rid of it and all dependencies thereupon.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Ildar Musin <i(dot)musin(at)postgrespro(dot)ru>
To:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [HACKERS] Custom compression methods
Date:	2017-11-16 15:09:37
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi Ildus,

On 14.11.2017 16:23, Ildus Kurbangaliev wrote:
> On Thu, 2 Nov 2017 15:28:36 +0300 Ildus Kurbangaliev
> <i(dot)kurbangaliev(at)postgrespro(dot)ru> wrote:
>
>> On Tue, 12 Sep 2017 17:55:05 +0300 Ildus Kurbangaliev
>> <i(dot)kurbangaliev(at)postgrespro(dot)ru> wrote:
>>
>>>
>>> Attached rebased version of the patch. Added support of pg_dump,
>>> the code was simplified, and a separate cache for compression
>>> options was added.
>>>
>>
>> Attached version 3 of the patch. Rebased to the current master,
>> removed ALTER TYPE .. SET COMPRESSED syntax, fixed bug in
>> compression options cache.
>>
>
> Attached version 4 of the patch. Fixed pg_upgrade and few other
> bugs.
>

I've started to review your code. And even though it's fine overall I
have few questions and comments (aside from DROP COMPRESSION METHOD
discussion).

1. I'm not sure about proposed syntax for ALTER TABLE command:

>> ALTER TABLE t ALTER COLUMN a SET COMPRESSED <cmname> WITH
>> (<options>); ALTER TABLE t ALTER COLUMN a SET NOT COMPRESSED;

ISTM it is more common for Postgres to use syntax like SET/DROP for
column options (SET/DROP NOT NULL, DEFAULT etc). My suggestion would be:

ALTER TABLE t ALTER COLUMN a SET COMPRESSED USING <compression_method>
WITH (<options>);
ALTER TABLE t ALTER COLUMN a DROP COMPRESSED;

(keyword USING here is similar to "CREATE INDEX ... USING <method>" syntax)

2. The way you changed DefineRelation() implies that caller is
responsible for creation of compression options. Probably it would be
better to create them within DefineRelation().

3. Few minor issues which seem like obsolete code:

Function freeRelOptions() is defined but never used.

Function getBaseTypeTuple() has been extracted from
getBaseTypeAndTypmod() but never used separately.

In toast_flatten_tuple_to_datum() there is untoasted_value variable
which is only used for meaningless assignment.

(Should I send a patch for that kind of issues?)

--
Ildar Musin
i(dot)musin(at)postgrespro(dot)ru

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [HACKERS] Custom compression methods
Date:	2017-11-19 23:04:53
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On 11/14/2017 02:23 PM, Ildus Kurbangaliev wrote:
>
> ...
>
> Attached version 4 of the patch. Fixed pg_upgrade and few other bugs.
>

I did a review of this today, and I think there are some things that
need improvement / fixing.

Firstly, some basic comments from just eye-balling the diff, then some
bugs I discovered after writing an extension adding lz4.

1) formatRelOptions/freeRelOptions are no longer needed (I see Ildar
already pointer that out)

2) There's unnecessary whitespace (extra newlines) on a couple of
places, which is needlessly increasing the size of the patch. Small
difference, but annoying.

3) tuptoaster.c

Why do you change 'info' from int32 to uint32? Seems unnecessary.

Adding new 'att' variable in toast_insert_or_update is confusing, as
there already is 'att' in the very next loop. Technically it's correct,
but I'd bet it'll lead to some WTF?! moments later. I propose to just
use TupleDescAttr(tupleDesc,i) on the two places where it matters,
around line 808.

There are no comments for init_compression_options_htab and
get_compression_options_info, so that needs to be fixed. Moreover, the
names are confusing because what we really get is not just 'options' but
the compression routines too.

4) gen_db_file_maps probably shouldn't do the fprints, right?

5) not sure why you modify src/tools/pgindent/exclude_file_patterns

6) I'm rather confused by AttributeCompression vs. ColumnCompression. I
mean, attribute==column, right? Of course, one is for data from parser,
the other one is for internal info. But can we make the naming clearer?

7) The docs in general are somewhat unsatisfactory, TBH. For example the
ColumnCompression has no comments, unlike everything else in parsenodes.
Similarly for the SGML docs - I suggest to expand them to resemble FDW
docs (https://www.postgresql.org/docs/10/static/fdwhandler.html) which
also follows the handler/routines pattern.

8) One of the unclear things if why we even need 'drop' routing. It
seems that if it's defined DropAttributeCompression does something. But
what should it do? I suppose dropping the options should be done using
dependencies (just like we drop columns in this case).

BTW why does DropAttributeCompression mess with att->attisdropped in
this way? That seems a bit odd.

9) configure routines that only check if (options != NIL) and then error
out (like tsvector_configure) seem a bit unnecessary. Just allow it to
be NULL in CompressionMethodRoutine, and throw an error if options is
not NIL for such compression method.

10) toast_compress_datum still does this:

if (!ac && (valsize < PGLZ_strategy_default->min_input_size ||
valsize > PGLZ_strategy_default->max_input_size))

which seems rather pglz-specific (the naming is a hint). Why shouldn't
this be specific to compression, exposed either as min/max constants, or
wrapped in another routine - size_is_valid() or something like that?

11) The comments in toast_compress_datum probably need updating, as it
still references to pglz specifically. I guess the new compression
methods do matter too.

12) get_compression_options_info organizes the compression info into a
hash table by OID. The hash table implementation assumes the hash key is
at the beginning of the entry, but AttributeCompression is defined like
this:

typedef struct
{
CompressionMethodRoutine *routine;
List *options;
Oid cmoptoid;
} AttributeCompression;

Which means get_compression_options_info is busted, will never lookup
anything, and the hash table will grow by adding more and more entries
into the same bucket. Of course, this has extremely negative impact on
performance (pretty much arbitrarily bad, depending on how many entries
you've already added to the hash table).

Moving the OID to the beginning of the struct fixes the issue.

13) When writing the experimental extension, I was extremely confused
about the regular varlena headers, custom compression headers, etc. In
the end I stole the code from tsvector.c and whacked it a bit until it
worked, but I wouldn't dare to claim I understand how it works.

This needs to be documented somewhere. For example postgres.h has a
bunch of paragraphs about varlena headers, so perhaps it should be
there? I see the patch tweaks some of the constants, but does not update
the comment at all.

Perhaps it would be useful to provide some additional macros making
access to custom-compressed varlena values easier. Or perhaps the
VARSIZE_ANY / VARSIZE_ANY_EXHDR / VARDATA_ANY already support that? This
part is not very clear to me.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment	Content-Type	Size
pg_lz4.tgz	application/x-compressed-tar	1.5 KB

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>
Cc:	Oleg Bartunov <obartunov(at)gmail(dot)com>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2017-11-19 23:23:23
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 11/15/2017 02:13 PM, Robert Haas wrote:
> On Wed, Nov 15, 2017 at 4:09 AM, Ildus Kurbangaliev
> <i(dot)kurbangaliev(at)postgrespro(dot)ru> wrote:
>> So in the next version of the patch I can just unlink the options from
>> compression methods and dropping compression method will not affect
>> already compressed tuples. They still could be decompressed.
>
> I guess I don't understand how that can work. I mean, if somebody
> removes a compression method - i.e. uninstalls the library - and you
> don't have a way to make sure there are no tuples that can only be
> uncompressed by that library - then you've broken the database.
> Ideally, there should be a way to add a new compression method via an
> extension ... and then get rid of it and all dependencies thereupon.
>

I share your confusion. Once you do DROP COMPRESSION METHOD, there must
be no remaining data compressed with it. But that's what the patch is
doing already - it enforces this using dependencies, as usual.

Ildus, can you explain what you meant? How could the data still be
decompressed after DROP COMPRESSION METHOD, and possibly after removing
the .so library?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Oleg Bartunov <obartunov(at)gmail(dot)com>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2017-11-20 09:44:28
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, 20 Nov 2017 00:23:23 +0100
Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:

> On 11/15/2017 02:13 PM, Robert Haas wrote:
> > On Wed, Nov 15, 2017 at 4:09 AM, Ildus Kurbangaliev
> > <i(dot)kurbangaliev(at)postgrespro(dot)ru> wrote:
> >> So in the next version of the patch I can just unlink the options
> >> from compression methods and dropping compression method will not
> >> affect already compressed tuples. They still could be
> >> decompressed.
> >
> > I guess I don't understand how that can work. I mean, if somebody
> > removes a compression method - i.e. uninstalls the library - and you
> > don't have a way to make sure there are no tuples that can only be
> > uncompressed by that library - then you've broken the database.
> > Ideally, there should be a way to add a new compression method via
> > an extension ... and then get rid of it and all dependencies
> > thereupon.
>
> I share your confusion. Once you do DROP COMPRESSION METHOD, there
> must be no remaining data compressed with it. But that's what the
> patch is doing already - it enforces this using dependencies, as
> usual.
>
> Ildus, can you explain what you meant? How could the data still be
> decompressed after DROP COMPRESSION METHOD, and possibly after
> removing the .so library?

The removal of the .so library will broke all compressed tuples. I
don't see a way to avoid it. I meant that DROP COMPRESSION METHOD could
remove the record from 'pg_compression' table, but actually the
compressed tuple needs only a record from 'pg_compression_opt' where
its options are located. And there is dependency between an extension
and the options so you can't just remove the extension without CASCADE,
postgres will complain.

Still it's a problem if the user used for example `SELECT
<compressed_column> INTO * FROM *` because postgres will copy compressed
tuples, and there will not be any dependencies between destination and
the options.

Also thank you for review. I will look into it today.

--
---
Ildus Kurbangaliev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>, Andres Freund <andres(at)anarazel(dot)de>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Oleg Bartunov <obartunov(at)gmail(dot)com>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2017-11-20 15:18:30
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 11/20/2017 10:44 AM, Ildus Kurbangaliev wrote:
> On Mon, 20 Nov 2017 00:23:23 +0100
> Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>
>> On 11/15/2017 02:13 PM, Robert Haas wrote:
>>> On Wed, Nov 15, 2017 at 4:09 AM, Ildus Kurbangaliev
>>> <i(dot)kurbangaliev(at)postgrespro(dot)ru> wrote:
>>>> So in the next version of the patch I can just unlink the options
>>>> from compression methods and dropping compression method will not
>>>> affect already compressed tuples. They still could be
>>>> decompressed.
>>>
>>> I guess I don't understand how that can work. I mean, if somebody
>>> removes a compression method - i.e. uninstalls the library - and you
>>> don't have a way to make sure there are no tuples that can only be
>>> uncompressed by that library - then you've broken the database.
>>> Ideally, there should be a way to add a new compression method via
>>> an extension ... and then get rid of it and all dependencies
>>> thereupon.
>>
>> I share your confusion. Once you do DROP COMPRESSION METHOD, there
>> must be no remaining data compressed with it. But that's what the
>> patch is doing already - it enforces this using dependencies, as
>> usual.
>>
>> Ildus, can you explain what you meant? How could the data still be
>> decompressed after DROP COMPRESSION METHOD, and possibly after
>> removing the .so library?
>
> The removal of the .so library will broke all compressed tuples. I
> don't see a way to avoid it. I meant that DROP COMPRESSION METHOD could
> remove the record from 'pg_compression' table, but actually the
> compressed tuple needs only a record from 'pg_compression_opt' where
> its options are located. And there is dependency between an extension
> and the options so you can't just remove the extension without CASCADE,
> postgres will complain.
>

I don't think we need to do anything smart here - it should behave just
like dropping a data type, for example. That is, error out if there are
columns using the compression method (without CASCADE), and drop all the
columns (with CASCADE).

Leaving around the pg_compression_opt is not a solution. Not only it's
confusing and I'm not aware about any extension because the user is
likely to remove the .so file (perhaps not directly, but e.g. by
removing the rpm package providing it).

> Still it's a problem if the user used for example `SELECT
> <compressed_column> INTO * FROM *` because postgres will copy compressed
> tuples, and there will not be any dependencies between destination and
> the options.
>

This seems like a rather fatal design flaw, though. I'd say we need to
force recompression of the data, in such cases. Otherwise all the
dependency tracking is rather pointless.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Евгений Шишкин <itparanoia(at)gmail(dot)com>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>, Andres Freund <andres(at)anarazel(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, Oleg Bartunov <obartunov(at)gmail(dot)com>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2017-11-20 15:21:14
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On Nov 20, 2017, at 18:18, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>
>
> I don't think we need to do anything smart here - it should behave just
> like dropping a data type, for example. That is, error out if there are
> columns using the compression method (without CASCADE), and drop all the
> columns (with CASCADE).

What about instead of dropping column we leave data uncompressed?

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	Евгений Шишкин <itparanoia(at)gmail(dot)com>
Cc:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>, Andres Freund <andres(at)anarazel(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, Oleg Bartunov <obartunov(at)gmail(dot)com>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2017-11-20 15:29:11
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 11/20/2017 04:21 PM, Евгений Шишкин wrote:
>
>
>> On Nov 20, 2017, at 18:18, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com
>> <mailto:tomas(dot)vondra(at)2ndquadrant(dot)com>> wrote:
>>
>>
>> I don't think we need to do anything smart here - it should behave just
>> like dropping a data type, for example. That is, error out if there are
>> columns using the compression method (without CASCADE), and drop all the
>> columns (with CASCADE).
>
> What about instead of dropping column we leave data uncompressed?
>

That requires you to go through the data and rewrite the whole table.
And I'm not aware of a DROP command doing that, instead they just drop
the dependent objects (e.g. DROP TYPE, ...). So per PLOS the DROP
COMPRESSION METHOD command should do that too.

But I'm wondering if ALTER COLUMN ... SET NOT COMPRESSED should do that
(currently it only disables compression for new data).

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Евгений Шишкин <itparanoia(at)gmail(dot)com>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>, Andres Freund <andres(at)anarazel(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, Oleg Bartunov <obartunov(at)gmail(dot)com>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2017-11-20 15:43:38
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On Nov 20, 2017, at 18:29, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>
>>
>> What about instead of dropping column we leave data uncompressed?
>>
>
> That requires you to go through the data and rewrite the whole table.
> And I'm not aware of a DROP command doing that, instead they just drop
> the dependent objects (e.g. DROP TYPE, ...). So per PLOS the DROP
> COMPRESSION METHOD command should do that too.

Well, there is no much you can do with DROP TYPE. But i'd argue that compression
is different. We do not drop data in case of DROP STATISTICS or DROP INDEX.

At least there should be a way to easily alter compression method then.

From:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	Евгений Шишкин <itparanoia(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, Oleg Bartunov <obartunov(at)gmail(dot)com>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2017-11-20 15:44:35
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, 20 Nov 2017 16:29:11 +0100
Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:

> On 11/20/2017 04:21 PM, Евгений Шишкин wrote:
> >
> >
> >> On Nov 20, 2017, at 18:18, Tomas Vondra
> >> <tomas(dot)vondra(at)2ndquadrant(dot)com
> >> <mailto:tomas(dot)vondra(at)2ndquadrant(dot)com>> wrote:
> >>
> >>
> >> I don't think we need to do anything smart here - it should behave
> >> just like dropping a data type, for example. That is, error out if
> >> there are columns using the compression method (without CASCADE),
> >> and drop all the columns (with CASCADE).
> >
> > What about instead of dropping column we leave data uncompressed?
> >
>
> That requires you to go through the data and rewrite the whole table.
> And I'm not aware of a DROP command doing that, instead they just drop
> the dependent objects (e.g. DROP TYPE, ...). So per PLOS the DROP
> COMPRESSION METHOD command should do that too.
>
> But I'm wondering if ALTER COLUMN ... SET NOT COMPRESSED should do
> that (currently it only disables compression for new data).

If the table is big, decompression could take an eternity. That's why i
decided to only to disable it and the data could be decompressed using
compression options.

My idea was to keep compression options forever, since there will not
be much of them in one database. Still that requires that extension is
not removed.

I will try to find a way how to recompress data first in case it moves
to another table.

--
---
Ildus Kurbangaliev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	Евгений Шишкин <itparanoia(at)gmail(dot)com>
Cc:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>, Andres Freund <andres(at)anarazel(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, Oleg Bartunov <obartunov(at)gmail(dot)com>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2017-11-20 16:31:45
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 11/20/2017 04:43 PM, Евгений Шишкин wrote:
>
>
>> On Nov 20, 2017, at 18:29, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>>
>>>
>>> What about instead of dropping column we leave data uncompressed?
>>>
>>
>> That requires you to go through the data and rewrite the whole table.
>> And I'm not aware of a DROP command doing that, instead they just drop
>> the dependent objects (e.g. DROP TYPE, ...). So per PLOS the DROP
>> COMPRESSION METHOD command should do that too.
>
> Well, there is no much you can do with DROP TYPE. But i'd argue that compression
> is different. We do not drop data in case of DROP STATISTICS or DROP INDEX.
>

But those DROP commands do not 'invalidate' data in the heap, so there's
no reason to drop the columns.

> At least there should be a way to easily alter compression method then.
>

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [HACKERS] Custom compression methods
Date:	2017-11-21 14:47:17
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, 20 Nov 2017 00:04:53 +0100
Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:

> I did a review of this today, and I think there are some things that
> need improvement / fixing.
>
> Firstly, some basic comments from just eye-balling the diff, then some
> bugs I discovered after writing an extension adding lz4.
>
> 1) formatRelOptions/freeRelOptions are no longer needed (I see Ildar
> already pointer that out)

I removed freeRelOptions, but formatRelOptions is used in other place.

>
> 2) There's unnecessary whitespace (extra newlines) on a couple of
> places, which is needlessly increasing the size of the patch. Small
> difference, but annoying.

Cleaned up.

>
> 3) tuptoaster.c
>
> Why do you change 'info' from int32 to uint32? Seems unnecessary.

That's because I use highest bit, and it makes number negative for
int32. I use right shifting to get that bit and right shift on negative
gives negative value too.

>
> Adding new 'att' variable in toast_insert_or_update is confusing, as
> there already is 'att' in the very next loop. Technically it's
> correct, but I'd bet it'll lead to some WTF?! moments later. I
> propose to just use TupleDescAttr(tupleDesc,i) on the two places
> where it matters, around line 808.
>
> There are no comments for init_compression_options_htab and
> get_compression_options_info, so that needs to be fixed. Moreover, the
> names are confusing because what we really get is not just 'options'
> but the compression routines too.

Removed extra 'att', and added comments.

>
> 4) gen_db_file_maps probably shouldn't do the fprints, right?
>
> 5) not sure why you modify src/tools/pgindent/exclude_file_patterns

My bad, removed these lines.

>
> 6) I'm rather confused by AttributeCompression vs. ColumnCompression.
> I mean, attribute==column, right? Of course, one is for data from
> parser, the other one is for internal info. But can we make the
> naming clearer?

For now I have renamed AttributeCompression to CompressionOptions, not
sure that's a good name but at least it gives less confusion.

>
> 7) The docs in general are somewhat unsatisfactory, TBH. For example
> the ColumnCompression has no comments, unlike everything else in
> parsenodes. Similarly for the SGML docs - I suggest to expand them to
> resemble FDW docs
> (https://www.postgresql.org/docs/10/static/fdwhandler.html) which
> also follows the handler/routines pattern.

I've added more comments. I think I'll add more documentation if the
committers will approve current syntax.

>
> 8) One of the unclear things if why we even need 'drop' routing. It
> seems that if it's defined DropAttributeCompression does something.
> But what should it do? I suppose dropping the options should be done
> using dependencies (just like we drop columns in this case).
>
> BTW why does DropAttributeCompression mess with att->attisdropped in
> this way? That seems a bit odd.

'drop' routine could be useful. An extension could do
something related with the attribute, like remove extra tables or
something else. The compression options will not be removed after
unlinking compression method from a column because there is still be
stored compressed data in that column.

That 'attisdropped' part has been removed.

>
> 9) configure routines that only check if (options != NIL) and then
> error out (like tsvector_configure) seem a bit unnecessary. Just
> allow it to be NULL in CompressionMethodRoutine, and throw an error
> if options is not NIL for such compression method.

Good idea, done.

>
> 10) toast_compress_datum still does this:
>
> if (!ac && (valsize < PGLZ_strategy_default->min_input_size ||
> valsize > PGLZ_strategy_default->max_input_size))
>
> which seems rather pglz-specific (the naming is a hint). Why shouldn't
> this be specific to compression, exposed either as min/max constants,
> or wrapped in another routine - size_is_valid() or something like
> that?

I agree, moved to the next block related with pglz.

>
> 11) The comments in toast_compress_datum probably need updating, as it
> still references to pglz specifically. I guess the new compression
> methods do matter too.

Done.

>
> 12) get_compression_options_info organizes the compression info into a
> hash table by OID. The hash table implementation assumes the hash key
> is at the beginning of the entry, but AttributeCompression is defined
> like this:
>
> typedef struct
> {
> CompressionMethodRoutine *routine;
> List *options;
> Oid cmoptoid;
> } AttributeCompression;
>
> Which means get_compression_options_info is busted, will never lookup
> anything, and the hash table will grow by adding more and more entries
> into the same bucket. Of course, this has extremely negative impact on
> performance (pretty much arbitrarily bad, depending on how many
> entries you've already added to the hash table).
>
> Moving the OID to the beginning of the struct fixes the issue.

Yeah, I fixed it before, but somehow managed to do not include it to the
patch.

>
> 13) When writing the experimental extension, I was extremely confused
> about the regular varlena headers, custom compression headers, etc. In
> the end I stole the code from tsvector.c and whacked it a bit until it
> worked, but I wouldn't dare to claim I understand how it works.
>
> This needs to be documented somewhere. For example postgres.h has a
> bunch of paragraphs about varlena headers, so perhaps it should be
> there? I see the patch tweaks some of the constants, but does not
> update the comment at all.

This point is good, I'm not sure how this documentation should look
like. I've just assumed that people should have deep undestanding of
varlenas if they're going to compress them. But now it's easy to make
mistake there. Maybe I should add some functions that help to construct
varlena, with different headers. I like the way is how jsonb is
constructed. It uses StringInfo and there are few helper functions
(reserveFromBuffer, appendToBuffer and others). Maybe
they should be not static.

>
> Perhaps it would be useful to provide some additional macros making
> access to custom-compressed varlena values easier. Or perhaps the
> VARSIZE_ANY / VARSIZE_ANY_EXHDR / VARDATA_ANY already support that?
> This part is not very clear to me.

These macros will work, custom compressed varlenas behave like old
compressed varlenas.

> > Still it's a problem if the user used for example `SELECT
> > <compressed_column> INTO * FROM *` because postgres will copy
> > compressed tuples, and there will not be any dependencies between
> > destination and the options.
> >
>
> This seems like a rather fatal design flaw, though. I'd say we need to
> force recompression of the data, in such cases. Otherwise all the
> dependency tracking is rather pointless.

Fixed this problem too. I've added recompression for datum that use
custom compression.

--
---
Ildus Kurbangaliev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Attachment	Content-Type	Size
custom_compression_methods_v5.patch	text/x-patch	251.8 KB

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [HACKERS] Custom compression methods
Date:	2017-11-21 17:47:49
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On 11/21/2017 03:47 PM, Ildus Kurbangaliev wrote:
> On Mon, 20 Nov 2017 00:04:53 +0100
> Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>
> ...
>
>> 6) I'm rather confused by AttributeCompression vs.
>> ColumnCompression. I mean, attribute==column, right? Of course, one
>> is for data from parser, the other one is for internal info. But
>> can we make the naming clearer?
>
> For now I have renamed AttributeCompression to CompressionOptions,
> not sure that's a good name but at least it gives less confusion.
>

I propose to use either

CompressionMethodOptions (and CompressionMethodRoutine)

CompressionOptions (and CompressionRoutine)

>>
>> 7) The docs in general are somewhat unsatisfactory, TBH. For example
>> the ColumnCompression has no comments, unlike everything else in
>> parsenodes. Similarly for the SGML docs - I suggest to expand them to
>> resemble FDW docs
>> (https://www.postgresql.org/docs/10/static/fdwhandler.html) which
>> also follows the handler/routines pattern.
>
> I've added more comments. I think I'll add more documentation if the
> committers will approve current syntax.
>

OK. Haven't reviewed this yet.

>>
>> 8) One of the unclear things if why we even need 'drop' routing. It
>> seems that if it's defined DropAttributeCompression does something.
>> But what should it do? I suppose dropping the options should be done
>> using dependencies (just like we drop columns in this case).
>>
>> BTW why does DropAttributeCompression mess with att->attisdropped in
>> this way? That seems a bit odd.
>
> 'drop' routine could be useful. An extension could do something
> related with the attribute, like remove extra tables or something
> else. The compression options will not be removed after unlinking
> compression method from a column because there is still be stored
> compressed data in that column.
>

OK. So something like a "global" dictionary used for the column, or
something like that? Sure, seems useful and I've been thinking about
that, but I think we badly need some extension using that, even if in a
very simple way. Firstly, we need a "how to" example, secondly we need
some way to test it.

>>
>> 13) When writing the experimental extension, I was extremely
>> confused about the regular varlena headers, custom compression
>> headers, etc. In the end I stole the code from tsvector.c and
>> whacked it a bit until it worked, but I wouldn't dare to claim I
>> understand how it works.
>>
>> This needs to be documented somewhere. For example postgres.h has
>> a bunch of paragraphs about varlena headers, so perhaps it should
>> be there? I see the patch tweaks some of the constants, but does
>> not update the comment at all.
>
> This point is good, I'm not sure how this documentation should look
> like. I've just assumed that people should have deep undestanding of
> varlenas if they're going to compress them. But now it's easy to
> make mistake there. Maybe I should add some functions that help to
> construct varlena, with different headers. I like the way is how
> jsonb is constructed. It uses StringInfo and there are few helper
> functions (reserveFromBuffer, appendToBuffer and others). Maybe they
> should be not static.
>

Not sure. My main problem was not understanding how this affects the
varlena header, etc. And I had no idea where to look.

>>
>> Perhaps it would be useful to provide some additional macros
>> making access to custom-compressed varlena values easier. Or
>> perhaps the VARSIZE_ANY / VARSIZE_ANY_EXHDR / VARDATA_ANY already
>> support that? This part is not very clear to me.
>
> These macros will work, custom compressed varlenas behave like old
> compressed varlenas.
>

OK. But then I don't understand why tsvector.c does things like

VARSIZE(data) - VARHDRSZ_CUSTOM_COMPRESSED - arrsize
VARRAWSIZE_4B_C(data) - arrsize

instead of

VARSIZE_ANY_EXHDR(data) - arrsize
VARSIZE_ANY(data) - arrsize

Seems somewhat confusing.

>>> Still it's a problem if the user used for example `SELECT
>>> <compressed_column> INTO * FROM *` because postgres will copy
>>> compressed tuples, and there will not be any dependencies
>>> between destination and the options.
>>>
>>
>> This seems like a rather fatal design flaw, though. I'd say we need
>> to force recompression of the data, in such cases. Otherwise all
>> the dependency tracking is rather pointless.
>
> Fixed this problem too. I've added recompression for datum that use
> custom compression.
>

Hmmm, it still doesn't work for me. See this:

test=# create extension pg_lz4 ;
CREATE EXTENSION
test=# create table t_lz4 (v text compressed lz4);
CREATE TABLE
test=# create table t_pglz (v text);
CREATE TABLE
test=# insert into t_lz4 select repeat(md5(1::text),300);
INSERT 0 1
test=# insert into t_pglz select * from t_lz4;
INSERT 0 1
test=# drop extension pg_lz4 cascade;
NOTICE: drop cascades to 2 other objects
DETAIL: drop cascades to compression options for lz4
drop cascades to table t_lz4 column v
DROP EXTENSION
test=# \c test
You are now connected to database "test" as user "user".
test=# insert into t_lz4 select repeat(md5(1::text),300);^C
test=# select * from t_pglz ;
ERROR: cache lookup failed for compression options 16419

That suggests no recompression happened.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Ildus K <i(dot)kurbangaliev(at)postgrespro(dot)ru>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [HACKERS] Custom compression methods
Date:	2017-11-21 20:28:55
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, 21 Nov 2017 18:47:49 +0100
Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:

>
> I propose to use either
>
> CompressionMethodOptions (and CompressionMethodRoutine)
>
> or
>
> CompressionOptions (and CompressionRoutine)

Sounds good, thanks.

>
> OK. But then I don't understand why tsvector.c does things like
>
> VARSIZE(data) - VARHDRSZ_CUSTOM_COMPRESSED - arrsize
> VARRAWSIZE_4B_C(data) - arrsize
>
> instead of
>
> VARSIZE_ANY_EXHDR(data) - arrsize
> VARSIZE_ANY(data) - arrsize
>
> Seems somewhat confusing.
>

VARRAWSIZE_4B_C returns original size of data, before compression (from
va_rawsize in current postgres, and from va_info in my patch), not size
of the already compressed data, so you can't use VARSIZE_ANY here.

VARSIZE_ANY_EXHDR in current postgres returns VARSIZE-VARHDRSZ, despite
the varlena is compressed or not, so I just kept this behavior for
custom compressed varlenas too. If you look into tuptoaster.c you will
also see lines like 'VARSIZE(attr) - TOAST_COMPRESS_HDRSZ'. So I think
if VARSIZE_ANY_EXHDR will subtract different header sizes then it
should subtract them for usual compressed varlenas too.

> >
>
> Hmmm, it still doesn't work for me. See this:
>
> test=# create extension pg_lz4 ;
> CREATE EXTENSION
> test=# create table t_lz4 (v text compressed lz4);
> CREATE TABLE
> test=# create table t_pglz (v text);
> CREATE TABLE
> test=# insert into t_lz4 select repeat(md5(1::text),300);
> INSERT 0 1
> test=# insert into t_pglz select * from t_lz4;
> INSERT 0 1
> test=# drop extension pg_lz4 cascade;
> NOTICE: drop cascades to 2 other objects
> DETAIL: drop cascades to compression options for lz4
> drop cascades to table t_lz4 column v
> DROP EXTENSION
> test=# \c test
> You are now connected to database "test" as user "user".
> test=# insert into t_lz4 select repeat(md5(1::text),300);^C
> test=# select * from t_pglz ;
> ERROR: cache lookup failed for compression options 16419
>
> That suggests no recompression happened.

I will check that. Is your extension published somewhere?

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	Ildus K <i(dot)kurbangaliev(at)postgrespro(dot)ru>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [HACKERS] Custom compression methods
Date:	2017-11-21 20:42:37
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 11/21/2017 09:28 PM, Ildus K wrote:
>> Hmmm, it still doesn't work for me. See this:
>>
>> test=# create extension pg_lz4 ;
>> CREATE EXTENSION
>> test=# create table t_lz4 (v text compressed lz4);
>> CREATE TABLE
>> test=# create table t_pglz (v text);
>> CREATE TABLE
>> test=# insert into t_lz4 select repeat(md5(1::text),300);
>> INSERT 0 1
>> test=# insert into t_pglz select * from t_lz4;
>> INSERT 0 1
>> test=# drop extension pg_lz4 cascade;
>> NOTICE: drop cascades to 2 other objects
>> DETAIL: drop cascades to compression options for lz4
>> drop cascades to table t_lz4 column v
>> DROP EXTENSION
>> test=# \c test
>> You are now connected to database "test" as user "user".
>> test=# insert into t_lz4 select repeat(md5(1::text),300);^C
>> test=# select * from t_pglz ;
>> ERROR: cache lookup failed for compression options 16419
>>
>> That suggests no recompression happened.
>
> I will check that. Is your extension published somewhere?
>

No, it was just an experiment, so I've only attached it to the initial
review. Attached is an updated version, with a fix or two.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment	Content-Type	Size
pg_lz4.tgz	application/x-compressed-tar	1.5 KB

From:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org, Ildar Musin <i(dot)musin(at)postgrespro(dot)ru>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2017-11-23 09:38:49
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, 21 Nov 2017 18:47:49 +0100
Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:

Should be fixed in the attached patch. I've changed your extension a
little bit according changes in the new patch (also in attachments).

Also I renamed few functions, added more comments and simplified the
code related with DefineRelation (thanks to Ildar Musin suggestion).

--
---
Ildus Kurbangaliev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Attachment	Content-Type	Size
custom_compression_methods_v6.patch	text/x-patch	256.4 KB
pg_lz4.tar.gz	application/gzip	1.4 KB

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>
Cc:	pgsql-hackers(at)postgresql(dot)org, Ildar Musin <i(dot)musin(at)postgrespro(dot)ru>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2017-11-23 20:54:32
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On 11/23/2017 10:38 AM, Ildus Kurbangaliev wrote:
> On Tue, 21 Nov 2017 18:47:49 +0100
> Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>
>>>
>>
>> Hmmm, it still doesn't work for me. See this:
>>
>> test=# create extension pg_lz4 ;
>> CREATE EXTENSION
>> test=# create table t_lz4 (v text compressed lz4);
>> CREATE TABLE
>> test=# create table t_pglz (v text);
>> CREATE TABLE
>> test=# insert into t_lz4 select repeat(md5(1::text),300);
>> INSERT 0 1
>> test=# insert into t_pglz select * from t_lz4;
>> INSERT 0 1
>> test=# drop extension pg_lz4 cascade;
>> NOTICE: drop cascades to 2 other objects
>> DETAIL: drop cascades to compression options for lz4
>> drop cascades to table t_lz4 column v
>> DROP EXTENSION
>> test=# \c test
>> You are now connected to database "test" as user "user".
>> test=# insert into t_lz4 select repeat(md5(1::text),300);^C
>> test=# select * from t_pglz ;
>> ERROR: cache lookup failed for compression options 16419
>>
>> That suggests no recompression happened.
>
> Should be fixed in the attached patch. I've changed your extension a
> little bit according changes in the new patch (also in attachments).
>

Hmm, this seems to have fixed it, but only in one direction. Consider this:

create table t_pglz (v text);
create table t_lz4 (v text compressed lz4);

insert into t_pglz select repeat(md5(i::text),300)
from generate_series(1,100000) s(i);

insert into t_lz4 select repeat(md5(i::text),300)
from generate_series(1,100000) s(i);

\d+

truncate t_pglz;
insert into t_pglz select * from t_lz4;

\d+

which is fine. But in the other direction, this happens

truncate t_lz4;
insert into t_lz4 select * from t_pglz;

which means the data is still pglz-compressed. That's rather strange, I
guess, and it should compress the data using the compression method set
for the target table instead.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org, Ildar Musin <i(dot)musin(at)postgrespro(dot)ru>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2017-11-24 09:38:00
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

That's actually an interesting issue. It happens because if tuple fits
to page then postgres just moves it as is. I've just added
recompression if it has custom compressed datums to keep dependencies
right. But look:

create table t1(a text);
create table t2(a text);
alter table t2 alter column a set storage external;
insert into t1 select repeat(md5(i::text),300) from
generate_series(1,100000) s(i);
\d+

insert into t2 select * from t1;

\d+

That means compressed datums now in the column with storage specified as
external. I'm not sure that's a bug or a feature. Lets insert them
usual way:

delete from t2;
insert into t2 select repeat(md5(i::text),300) from
generate_series(1,100000) s(i);
\d+

Maybe there should be more common solution like comparison of attribute
properties?

--
---
Ildus Kurbangaliev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>
Cc:	pgsql-hackers(at)postgresql(dot)org, Ildar Musin <i(dot)musin(at)postgrespro(dot)ru>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2017-11-25 01:13:41
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On 11/24/2017 10:38 AM, Ildus Kurbangaliev wrote:
> ...
> That means compressed datums now in the column with storage
> specified as external. I'm not sure that's a bug or a feature.
>

Interesting. Never realized it behaves like this. Not sure if it's
intentional or not (i.e. bug vs. feature).

Maybe, not sure what the right solution is. I just know that if we allow
inserting data into arbitrary tables without recompression, we may end
up with data that can't be decompressed.

I agree that the behavior with extended storage is somewhat similar, but
the important distinction is that while that is surprising the data is
still accessible.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>
Cc:	pgsql-hackers(at)postgresql(dot)org, Ildar Musin <i(dot)musin(at)postgrespro(dot)ru>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2017-11-25 05:40:00
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

I ran into another issue - after inserting some data into a table with a
tsvector column (without any compression defined), I can no longer read
the data.

This is what I get in the console:

db=# select max(md5(body_tsvector::text)) from messages;
ERROR: cache lookup failed for compression options 6432

and the stack trace looks like this:

Breakpoint 1, get_cached_compression_options (cmoptoid=6432) at
tuptoaster.c:2563
2563 elog(ERROR, "cache lookup failed for compression options %u",
cmoptoid);
(gdb) bt
#0 get_cached_compression_options (cmoptoid=6432) at tuptoaster.c:2563
#1 0x00000000004bf3da in toast_decompress_datum (attr=0x2b44148) at
tuptoaster.c:2390
#2 0x00000000004c0c1e in heap_tuple_untoast_attr (attr=0x2b44148) at
tuptoaster.c:225
#3 0x000000000083f976 in pg_detoast_datum (datum=<optimized out>) at
fmgr.c:1829
#4 0x00000000008072de in tsvectorout (fcinfo=0x2b41e00) at tsvector.c:315
#5 0x00000000005fae00 in ExecInterpExpr (state=0x2b414b8,
econtext=0x2b25ab0, isnull=<optimized out>) at execExprInterp.c:1131
#6 0x000000000060bdf4 in ExecEvalExprSwitchContext
(isNull=0x7fffffe9bd37 "", econtext=0x2b25ab0, state=0x2b414b8) at
../../../src/include/executor/executor.h:299

It seems the VARATT_IS_CUSTOM_COMPRESSED incorrectly identifies the
value as custom-compressed for some reason.

Not sure why, but the tsvector column is populated by a trigger that
simply does

NEW.body_tsvector
:= to_tsvector('english', strip_replies(NEW.body_plain));

If needed, the complete tool is here:

https://bitbucket.org/tvondra/archie

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org, Ildar Musin <i(dot)musin(at)postgrespro(dot)ru>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2017-11-27 15:52:18
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, 25 Nov 2017 06:40:00 +0100
Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:

> Hi,
>
> I ran into another issue - after inserting some data into a table
> with a tsvector column (without any compression defined), I can no
> longer read the data.
>
> This is what I get in the console:
>
> db=# select max(md5(body_tsvector::text)) from messages;
> ERROR: cache lookup failed for compression options 6432
>
> and the stack trace looks like this:
>
> Breakpoint 1, get_cached_compression_options (cmoptoid=6432) at
> tuptoaster.c:2563
> 2563 elog(ERROR, "cache lookup failed for
> compression options %u", cmoptoid);
> (gdb) bt
> #0 get_cached_compression_options (cmoptoid=6432) at
> tuptoaster.c:2563 #1 0x00000000004bf3da in toast_decompress_datum
> (attr=0x2b44148) at tuptoaster.c:2390
> #2 0x00000000004c0c1e in heap_tuple_untoast_attr (attr=0x2b44148) at
> tuptoaster.c:225
> #3 0x000000000083f976 in pg_detoast_datum (datum=<optimized out>) at
> fmgr.c:1829
> #4 0x00000000008072de in tsvectorout (fcinfo=0x2b41e00) at
> tsvector.c:315 #5 0x00000000005fae00 in ExecInterpExpr
> (state=0x2b414b8, econtext=0x2b25ab0, isnull=<optimized out>) at
> execExprInterp.c:1131 #6 0x000000000060bdf4 in
> ExecEvalExprSwitchContext (isNull=0x7fffffe9bd37 "",
> econtext=0x2b25ab0, state=0x2b414b8)
> at ../../../src/include/executor/executor.h:299
>
> It seems the VARATT_IS_CUSTOM_COMPRESSED incorrectly identifies the
> value as custom-compressed for some reason.
>
> Not sure why, but the tsvector column is populated by a trigger that
> simply does
>
> NEW.body_tsvector
> := to_tsvector('english', strip_replies(NEW.body_plain));
>
> If needed, the complete tool is here:
>
> https://bitbucket.org/tvondra/archie
>

Hi. This looks like a serious bug, but I couldn't reproduce
it yet. Did you upgrade some old database or this bug happened after
insertion of all data to new database? I tried using your 'archie'
tool to download mailing lists and insert them to database, but couldn't
catch any errors.

--
---
Ildus Kurbangaliev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>
Cc:	pgsql-hackers(at)postgresql(dot)org, Ildar Musin <i(dot)musin(at)postgrespro(dot)ru>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2017-11-27 17:20:12
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On 11/27/2017 04:52 PM, Ildus Kurbangaliev wrote:
> ...
>
> Hi. This looks like a serious bug, but I couldn't reproduce it yet.
> Did you upgrade some old database or this bug happened after
> insertion of all data to new database? I tried using your 'archie'
> tool to download mailing lists and insert them to database, but
> couldn't catch any errors.
>

I can trigger it pretty reliably with these steps:

git checkout f65d21b258085bdc8ef2cc282ab1ff12da9c595c
patch -p1 < ~/custom_compression_methods_v6.patch
./configure --enable-debug --enable-cassert \
CFLAGS="-fno-omit-frame-pointer -O0 -DRANDOMIZE_ALLOCATED_MEMORY" \
--prefix=/home/postgres/pg-compress
make -s clean && make -s -j4 install
cd contrib/
make -s clean && make -s -j4 install

export PATH=/home/postgres/pg-compress/bin:$PATH
pg_ctl -D /mnt/raid/pg-compress init
pg_ctl -D /mnt/raid/pg-compress -l compress.log start
createdb archie
cd ~/archie/sql/
psql archie < create.sql

~/archie/bin/load.py --workers 4 --db archie */* > load.log 2>&1

I guess the trick might be -DRANDOMIZE_ALLOCATED_MEMORY (I first tried
without it, and it seemed working fine). If that's the case, I bet there
is a palloc that should have been palloc0, or something like that.

If you still can't reproduce that, I may give you access to this machine
so that you can debug it there.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org, Ildar Musin <i(dot)musin(at)postgrespro(dot)ru>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2017-11-28 13:29:18
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, 27 Nov 2017 18:20:12 +0100
Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:

> I guess the trick might be -DRANDOMIZE_ALLOCATED_MEMORY (I first tried
> without it, and it seemed working fine). If that's the case, I bet
> there is a palloc that should have been palloc0, or something like
> that.

Thanks, that was it. I've been able to reproduce this bug. The attached
patch should fix this bug and I've also added recompression when
tuples moved to the relation with the compressed attribute.

--
---
Ildus Kurbangaliev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Attachment	Content-Type	Size
custom_compression_methods_v7.patch	text/x-patch	260.0 KB

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>
Cc:	pgsql-hackers(at)postgresql(dot)org, Ildar Musin <i(dot)musin(at)postgrespro(dot)ru>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2017-11-29 23:30:37
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 11/28/2017 02:29 PM, Ildus Kurbangaliev wrote:
> On Mon, 27 Nov 2017 18:20:12 +0100
> Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>
>> I guess the trick might be -DRANDOMIZE_ALLOCATED_MEMORY (I first
>> tried without it, and it seemed working fine). If that's the case,
>> I bet there is a palloc that should have been palloc0, or something
>> like that.
>
> Thanks, that was it. I've been able to reproduce this bug. The
> attached patch should fix this bug and I've also added recompression
> when tuples moved to the relation with the compressed attribute.
>

I've done many tests with fulltext search on the mail archive, using
different compression algorithm, and this time it worked fine. So I can
confirm v7 fixes the issue.

Let me elaborate a bit about the benchmarking I did. I realize the patch
is meant to provide only an "API" for custom compression methods, and so
benchmarking of existing general-purpose algorithms (replacing the
built-in pglz) may seem a bit irrelevant. But I'll draw some conclusions
from that, so please bear with me. Or just skip the next section.

------------------ benchmark / start ------------------

I was curious how much better we could do than the built-in compression,
so I've whipped together a bunch of extensions for a few common
general-purpose compression algorithms (lz4, gz, bz2, zstd, brotli and
snappy), loaded the community mailing list archives using "archie" [1]
and ran a bunch of real-world full-text queries on it. I've used
"default" (or "medium") compression levels for all algorithms.

For the loads, the results look like this:

seconds size
-------------------------
pglz 1631 9786
zstd 1844 7102
lz4 1582 9537
bz2 2382 7670
gz 1703 7067
snappy 1587 12288
brotli 10973 6180

According to those results the algorithms seem quite comparable, with
the exception of snappy and brotli. Snappy supposedly aims for fast
compression and not compression ratio, but it's about as fast as the
other algorithms and compression ratio is almost 2x worse. Brotli is
much slower, although it gets better compression ratio.

For the queries, I ran about 33k of real-world queries (executed on the
community mailing lists in the past). Firstly, a simple

-- unsorted
SELECT COUNT(id) FROM messages WHERE body_tsvector @@ $1::tsquery

and then

-- sorted
SELECT id FROM messages WHERE body_tsvector @@ $1::tsquery
ORDER BY ts_rank(body_tsvector, $1::tsquery) DESC LIMIT 100;

Attached are 4 different charts, plotting pglz on x-axis and the other
algorithms on y-axis (so below diagonal => new algorithm is faster,
above diagonal => pglz is faster). I did this on two different machines,
one with only 8GB of RAM (so the dataset does not fit) and one much
larger (so everything fits into RAM).

I'm actually surprised how well the built-in pglz compression fares,
both on compression ratio and (de)compression speed. There is a bit of
noise for the fastest queries, when the alternative algorithms perform
better in non-trivial number of cases.

I suspect those cases may be due to not implementing anything like
PGLZ_strategy_default->min_comp_rate (requiring 25% size reduction), but
I'm not sure about it.

For more expensive queries, pglz pretty much wins. Of course, increasing
compression level might change the results a bit, but it will also make
the data loads slower.

------------------ benchmark / end ------------------

While the results may look differently for other datasets, my conclusion
is that it's unlikely we'll find another general-purpose algorithm
beating pglz (enough for people to switch to it, as they'll need to
worry about testing, deployment of extensions etc).

That doesn't necessarily mean supporting custom compression algorithms
is pointless, of course, but I think people will be much more interested
in exploiting known features of the data (instead of treating the values
as opaque arrays of bytes).

For example, I see the patch implements a special compression method for
tsvector values (used in the tests), exploiting from knowledge of
internal structure. I haven't tested if that is an improvement (either
in compression/decompression speed or compression ratio), though.

I can imagine other interesting use cases - for example values in JSONB
columns often use the same "schema" (keys, nesting, ...), so can I
imagine building a "dictionary" of JSON keys for the whole column ...

Ildus, is this a use case you've been aiming for, or were you aiming to
use the new API in a different way?

I wonder if the patch can be improved to handle this use case better.
For example, it requires knowledge the actual data type, instead of
treating it as opaque varlena / byte array. I see tsvector compression
does that by checking typeid in the handler.

But that fails for example with this example

db=# create domain x as tsvector;
CREATE DOMAIN
db=# create table t (a x compressed ts1);
ERROR: unexpected type 28198672 for tsvector compression handler

which means it's a few brick shy to properly support domains. But I
wonder if this should be instead specified in CREATE COMPRESSION METHOD
instead. I mean, something like

CREATE COMPRESSION METHOD ts1 HANDLER tsvector_compression_handler
TYPE tsvector;

When type is no specified, it applies to all varlena values. Otherwise
only to that type. Also, why not to allow setting the compression as the
default method for a data type, e.g.

CREATE COMPRESSION METHOD ts1 HANDLER tsvector_compression_handler
TYPE tsvector DEFAULT;

would automatically add 'COMPRESSED ts1' to all tsvector columns in new
CREATE TABLE commands.

BTW do you expect the tsvector compression to be generally useful, or is
it meant to be used only by the tests? If generally useful, perhaps it
should be created in pg_compression by default. If only for tests, maybe
it should be implemented in an extension in contrib (thus also serving
as example how to implement new methods).

I haven't thought about the JSONB use case very much, but I suppose that
could be done using the configure/drop methods. I mean, allocating the
dictionary somewhere (e.g. in a table created by an extension?). The
configure method gets the Form_pg_attribute record, so that should be
enough I guess.

But the patch is not testing those two methods at all, which seems like
something that needs to be addresses before commit. I don't expect a
full-fledged JSONB compression extension, but something simple that
actually exercises those methods in a meaningful way.

Similarly for the compression options - we need to test that the WITH
part is handled correctly (tsvector does not provide configure method).

Which reminds me I'm confused by pg_compression_opt. Consider this:

CREATE COMPRESSION METHOD ts1 HANDLER tsvector_compression_handler;
CREATE TABLE t (a tsvector COMPRESSED ts1);

DROP TABLE t;

db=# DROP COMPRESSION METHOD ts1;
ERROR: cannot drop compression method ts1 because other objects
depend on it
DETAIL: compression options for ts1 depends on compression method
ts1
HINT: Use DROP ... CASCADE to drop the dependent objects too.

I believe the pg_compression_opt is actually linked to pg_attribute, in
which case it should include (attrelid,attnum), and should be dropped
when the table is dropped.

I suppose it was done this way to work around the lack of recompression
(i.e. the compressed value might have ended in other table), but that is
no longer true.

A few more comments:

1) The patch makes optionListToArray (in foreigncmds.c) non-static, but
it's not used anywhere. So this seems like change that is no longer
necessary.

2) I see we add two un-reserved keywords in gram.y - COMPRESSION and
COMPRESSED. Perhaps COMPRESSION would be enough? I mean, we could do

CREATE TABLE t (c TEXT COMPRESSION cm1);
ALTER ... SET COMPRESSION name ...
ALTER ... SET COMPRESSION none;

Although I agree the "SET COMPRESSION none" is a bit strange.

3) heap_prepare_insert uses this chunk of code

+ else if (HeapTupleHasExternal(tup)
+ || RelationGetDescr(relation)->tdflags & TD_ATTR_CUSTOM_COMPRESSED
+ || HeapTupleHasCustomCompressed(tup)
+ || tup->t_len > TOAST_TUPLE_THRESHOLD)

Shouldn't that be rather

+ else if (HeapTupleHasExternal(tup)
+ || (RelationGetDescr(relation)->tdflags & TD_ATTR_CUSTOM_COMPRESSED
+ && HeapTupleHasCustomCompressed(tup))
+ || tup->t_len > TOAST_TUPLE_THRESHOLD)

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment	Content-Type	Size
data-sorted-exceeds-ram.png	image/png	216.6 KB
data-sorted-fits-into-ram.png	image/png	215.8 KB
data-unsorted-exceeds-ram.png	image/png	96.8 KB
data-unsorted-fits-into-ram.png	image/png	207.7 KB

From:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>, Ildar Musin <i(dot)musin(at)postgrespro(dot)ru>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2017-11-30 04:18:33
Message-ID:	CAB7nPqSfuxBnsJs4a04M+GZF1AUQoAgTt989n3oJOvGrZS-xXg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Nov 30, 2017 at 8:30 AM, Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> On 11/28/2017 02:29 PM, Ildus Kurbangaliev wrote:
>> On Mon, 27 Nov 2017 18:20:12 +0100
>> Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>>
>>> I guess the trick might be -DRANDOMIZE_ALLOCATED_MEMORY (I first
>>> tried without it, and it seemed working fine). If that's the case,
>>> I bet there is a palloc that should have been palloc0, or something
>>> like that.
>>
>> Thanks, that was it. I've been able to reproduce this bug. The
>> attached patch should fix this bug and I've also added recompression
>> when tuples moved to the relation with the compressed attribute.
>>
>
> I've done many tests with fulltext search on the mail archive, using
> different compression algorithm, and this time it worked fine. So I can
> confirm v7 fixes the issue.

Moved to next CF.
--
Michael

From:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org, Ildar Musin <i(dot)musin(at)postgrespro(dot)ru>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2017-11-30 15:20:09
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 30 Nov 2017 00:30:37 +0100
Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:

> While the results may look differently for other datasets, my
> conclusion is that it's unlikely we'll find another general-purpose
> algorithm beating pglz (enough for people to switch to it, as they'll
> need to worry about testing, deployment of extensions etc).
>
> That doesn't necessarily mean supporting custom compression algorithms
> is pointless, of course, but I think people will be much more
> interested in exploiting known features of the data (instead of
> treating the values as opaque arrays of bytes).
>
> For example, I see the patch implements a special compression method
> for tsvector values (used in the tests), exploiting from knowledge of
> internal structure. I haven't tested if that is an improvement (either
> in compression/decompression speed or compression ratio), though.
>
> I can imagine other interesting use cases - for example values in
> JSONB columns often use the same "schema" (keys, nesting, ...), so
> can I imagine building a "dictionary" of JSON keys for the whole
> column ...
>
> Ildus, is this a use case you've been aiming for, or were you aiming
> to use the new API in a different way?

Thank you for such good overview. I agree that pglz is pretty good as
general compression method, and there's no point to change it, at
least now.

I see few useful use cases for compression methods, it's special
compression methods for int[], timestamp[] for time series and yes,
dictionaries for jsonb, for which I have even already created an
extension (https://github.com/postgrespro/jsonbd) It's working and
giving promising results.

>
> I wonder if the patch can be improved to handle this use case better.
> For example, it requires knowledge the actual data type, instead of
> treating it as opaque varlena / byte array. I see tsvector compression
> does that by checking typeid in the handler.
>
> But that fails for example with this example
>
> db=# create domain x as tsvector;
> CREATE DOMAIN
> db=# create table t (a x compressed ts1);
> ERROR: unexpected type 28198672 for tsvector compression handler
>
> which means it's a few brick shy to properly support domains. But I
> wonder if this should be instead specified in CREATE COMPRESSION
> METHOD instead. I mean, something like
>
> CREATE COMPRESSION METHOD ts1 HANDLER tsvector_compression_handler
> TYPE tsvector;
>
> When type is no specified, it applies to all varlena values. Otherwise
> only to that type. Also, why not to allow setting the compression as
> the default method for a data type, e.g.
>
> CREATE COMPRESSION METHOD ts1 HANDLER tsvector_compression_handler
> TYPE tsvector DEFAULT;
>
> would automatically add 'COMPRESSED ts1' to all tsvector columns in
> new CREATE TABLE commands.

Initial version of the patch contains ALTER syntax that change
compression method for whole types, but I have decided to remove that
functionality for now because the patch is already quite complex and it
could be added later as separate patch.

Syntax was:
ALTER TYPE <type> SET COMPRESSION <cm>;

Specifying the supported type for the compression method is a good idea.
Maybe the following syntax would be better?

CREATE COMPRESSION METHOD ts1 FOR tsvector HANDLER
tsvector_compression_handler;

>
> BTW do you expect the tsvector compression to be generally useful, or
> is it meant to be used only by the tests? If generally useful,
> perhaps it should be created in pg_compression by default. If only
> for tests, maybe it should be implemented in an extension in contrib
> (thus also serving as example how to implement new methods).
>
> I haven't thought about the JSONB use case very much, but I suppose
> that could be done using the configure/drop methods. I mean,
> allocating the dictionary somewhere (e.g. in a table created by an
> extension?). The configure method gets the Form_pg_attribute record,
> so that should be enough I guess.
>
> But the patch is not testing those two methods at all, which seems
> like something that needs to be addresses before commit. I don't
> expect a full-fledged JSONB compression extension, but something
> simple that actually exercises those methods in a meaningful way.

I will move to tsvector_compression_handler to separate extension in
the next version. I added it more like as example, but also it could be
used to achieve a better compression for tsvectors. Tests on maillists
database ('archie' tables):

usual compression:

maillists=# select body_tsvector, subject_tsvector into t1 from
messages; SELECT 1114213
maillists=# select pg_size_pretty(pg_total_relation_size('t1'));
pg_size_pretty
----------------
1637 MB
(1 row)

tsvector_compression_handler:
maillists=# select pg_size_pretty(pg_total_relation_size('t2'));
pg_size_pretty
----------------
1521 MB
(1 row)

lz4:
maillists=# select pg_size_pretty(pg_total_relation_size('t3'));
pg_size_pretty
----------------
1487 MB
(1 row)

I don't stick to tsvector_compression_handler, I think if there
will some example that can use all the features then
tsvector_compression_handler could be replaced with it. My extension
for jsonb dictionaries is big enough and I'm not ready to try to include
it to the patch. I just see the use of 'drop' method, since there
should be way for extension to clean its resources, but I don't see
some simple enough usage for it in tests. Maybe just dummy methods for
'drop' and 'configure' will be enough for testing purposes.

>
> Similarly for the compression options - we need to test that the WITH
> part is handled correctly (tsvector does not provide configure
> method).

I could add some options to tsvector_compression_handler, like options
that change pglz_compress parameters.

>
> Which reminds me I'm confused by pg_compression_opt. Consider this:
>
> CREATE COMPRESSION METHOD ts1 HANDLER
> tsvector_compression_handler; CREATE TABLE t (a tsvector COMPRESSED
> ts1);
>
> db=# select * from pg_compression_opt ;
> cmoptoid | cmname | cmhandler | cmoptions
> ----------+--------+------------------------------+-----------
> 28198689 | ts1 | tsvector_compression_handler |
> (1 row)
>
> DROP TABLE t;
>
> db=# select * from pg_compression_opt ;
> cmoptoid | cmname | cmhandler | cmoptions
> ----------+--------+------------------------------+-----------
> 28198689 | ts1 | tsvector_compression_handler |
> (1 row)
>
> db=# DROP COMPRESSION METHOD ts1;
> ERROR: cannot drop compression method ts1 because other objects
> depend on it
> DETAIL: compression options for ts1 depends on compression method
> ts1
> HINT: Use DROP ... CASCADE to drop the dependent objects too.
>
> I believe the pg_compression_opt is actually linked to pg_attribute,
> in which case it should include (attrelid,attnum), and should be
> dropped when the table is dropped.
>
> I suppose it was done this way to work around the lack of
> recompression (i.e. the compressed value might have ended in other
> table), but that is no longer true.

Good point, since there is recompression now, the options could be
safely removed in case of dropping table. It will complicate pg_upgrade
but I think this is solvable.

>
> A few more comments:
>
> 1) The patch makes optionListToArray (in foreigncmds.c) non-static,
> but it's not used anywhere. So this seems like change that is no
> longer necessary.

I use this function in CreateCompressionOptions.

>
> 2) I see we add two un-reserved keywords in gram.y - COMPRESSION and
> COMPRESSED. Perhaps COMPRESSION would be enough? I mean, we could do
>
> CREATE TABLE t (c TEXT COMPRESSION cm1);
> ALTER ... SET COMPRESSION name ...
> ALTER ... SET COMPRESSION none;
>
> Although I agree the "SET COMPRESSION none" is a bit strange.

I agree, I've already changed syntax for the next version of the patch.
It's COMPRESSION instead of COMPRESSED and DROP COMPRESSION instead of
SET NOT COMPRESSED. Minus one word from grammar and it looks nicer.

>
> 3) heap_prepare_insert uses this chunk of code
>
> + else if (HeapTupleHasExternal(tup)
> + || RelationGetDescr(relation)->tdflags &
> TD_ATTR_CUSTOM_COMPRESSED
> + || HeapTupleHasCustomCompressed(tup)
> + || tup->t_len > TOAST_TUPLE_THRESHOLD)
>
> Shouldn't that be rather
>
> + else if (HeapTupleHasExternal(tup)
> + || (RelationGetDescr(relation)->tdflags &
> TD_ATTR_CUSTOM_COMPRESSED
> + && HeapTupleHasCustomCompressed(tup))
> + || tup->t_len > TOAST_TUPLE_THRESHOLD)

These conditions used for opposite directions.
HeapTupleHasCustomCompressed(tup) is true if tuple has compressed
datums inside and we need to decompress them first, and
TD_ATTR_CUSTOM_COMPRESSED flag means that relation we're putting the
data have columns with custom compression and maybe we need to compress
datums in current tuple.

--
---
Ildus Kurbangaliev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>
Cc:	pgsql-hackers(at)postgresql(dot)org, Ildar Musin <i(dot)musin(at)postgrespro(dot)ru>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2017-11-30 19:47:13
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 11/30/2017 04:20 PM, Ildus Kurbangaliev wrote:
> On Thu, 30 Nov 2017 00:30:37 +0100
> Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>
> ...
>
>> I can imagine other interesting use cases - for example values in
>> JSONB columns often use the same "schema" (keys, nesting, ...), so
>> can I imagine building a "dictionary" of JSON keys for the whole
>> column ...
>>
>> Ildus, is this a use case you've been aiming for, or were you aiming
>> to use the new API in a different way?
>
> Thank you for such good overview. I agree that pglz is pretty good as
> general compression method, and there's no point to change it, at
> least now.
>
> I see few useful use cases for compression methods, it's special
> compression methods for int[], timestamp[] for time series and yes,
> dictionaries for jsonb, for which I have even already created an
> extension (https://github.com/postgrespro/jsonbd) It's working and
> giving promising results.
>

I understand the reluctance to put everything into core, particularly
for complex patches that evolve quickly. Also, not having to put
everything into core is kinda why we have extensions.

But perhaps some of the simpler cases would be good candidates for core,
making it possible to test the feature?

>>
>> I wonder if the patch can be improved to handle this use case better.
>> For example, it requires knowledge the actual data type, instead of
>> treating it as opaque varlena / byte array. I see tsvector compression
>> does that by checking typeid in the handler.
>>
>> But that fails for example with this example
>>
>> db=# create domain x as tsvector;
>> CREATE DOMAIN
>> db=# create table t (a x compressed ts1);
>> ERROR: unexpected type 28198672 for tsvector compression handler
>>
>> which means it's a few brick shy to properly support domains. But I
>> wonder if this should be instead specified in CREATE COMPRESSION
>> METHOD instead. I mean, something like
>>
>> CREATE COMPRESSION METHOD ts1 HANDLER tsvector_compression_handler
>> TYPE tsvector;
>>
>> When type is no specified, it applies to all varlena values. Otherwise
>> only to that type. Also, why not to allow setting the compression as
>> the default method for a data type, e.g.
>>
>> CREATE COMPRESSION METHOD ts1 HANDLER tsvector_compression_handler
>> TYPE tsvector DEFAULT;
>>
>> would automatically add 'COMPRESSED ts1' to all tsvector columns in
>> new CREATE TABLE commands.
>
> Initial version of the patch contains ALTER syntax that change
> compression method for whole types, but I have decided to remove
> that functionality for now because the patch is already quite complex
> and it could be added later as separate patch.
>
> Syntax was:
> ALTER TYPE <type> SET COMPRESSION <cm>;
>
> Specifying the supported type for the compression method is a good idea.
> Maybe the following syntax would be better?
>
> CREATE COMPRESSION METHOD ts1 FOR tsvector HANDLER
> tsvector_compression_handler;
>

Understood. Good to know you've considered it, and I agree it doesn't
need to be there from the start (which makes the patch simpler).

>>
>> BTW do you expect the tsvector compression to be generally useful, or
>> is it meant to be used only by the tests? If generally useful,
>> perhaps it should be created in pg_compression by default. If only
>> for tests, maybe it should be implemented in an extension in contrib
>> (thus also serving as example how to implement new methods).
>>
>> I haven't thought about the JSONB use case very much, but I suppose
>> that could be done using the configure/drop methods. I mean,
>> allocating the dictionary somewhere (e.g. in a table created by an
>> extension?). The configure method gets the Form_pg_attribute record,
>> so that should be enough I guess.
>>
>> But the patch is not testing those two methods at all, which seems
>> like something that needs to be addresses before commit. I don't
>> expect a full-fledged JSONB compression extension, but something
>> simple that actually exercises those methods in a meaningful way.
>
> I will move to tsvector_compression_handler to separate extension in
> the next version. I added it more like as example, but also it could be
> used to achieve a better compression for tsvectors. Tests on maillists
> database ('archie' tables):
>
> usual compression:
>
> maillists=# select body_tsvector, subject_tsvector into t1 from
> messages; SELECT 1114213
> maillists=# select pg_size_pretty(pg_total_relation_size('t1'));
> pg_size_pretty
> ----------------
> 1637 MB
> (1 row)
>
> tsvector_compression_handler:
> maillists=# select pg_size_pretty(pg_total_relation_size('t2'));
> pg_size_pretty
> ----------------
> 1521 MB
> (1 row)
>
> lz4:
> maillists=# select pg_size_pretty(pg_total_relation_size('t3'));
> pg_size_pretty
> ----------------
> 1487 MB
> (1 row)
>
> I don't stick to tsvector_compression_handler, I think if there
> will some example that can use all the features then
> tsvector_compression_handler could be replaced with it.
>

OK. I think it's a nice use case (and nice gains on the compression
ratio), demonstrating the datatype-aware compression. The question is
why shouldn't this be built into the datatypes directly?

That would certainly be possible for tsvector, although it wouldn't be
as transparent (the datatype code would have to support it explicitly).

I'm a bit torn on this. The custom compression method patch makes the
compression mostly transparent for the datatype code (by adding an extra
"compression" header). But it's coupled to the datatype quite strongly
as it requires knowledge of the data type internals.

It's a bit less coupled for "generic" datatypes (e.g. arrays of other
types), where it may add important information (e.g. that the array
represents a chunk of timeseries data, which the array code can't
possibly know).

>
> My extension for jsonb dictionaries is big enough and I'm not ready
> to try to include it to the patch. I just see the use of 'drop'
> method, since there should be way for extension to clean its
> resources, but I don't see some simple enough usage for it in tests.
> Maybe just dummy methods for 'drop' and 'configure' will be enough
> for testing purposes.
>

OK.

>>
>> Similarly for the compression options - we need to test that the WITH
>> part is handled correctly (tsvector does not provide configure
>> method).
>
> I could add some options to tsvector_compression_handler, like options
> that change pglz_compress parameters.
>

+1 for doing that

>>
>> Which reminds me I'm confused by pg_compression_opt. Consider this:
>>
>> CREATE COMPRESSION METHOD ts1 HANDLER
>> tsvector_compression_handler; CREATE TABLE t (a tsvector COMPRESSED
>> ts1);
>>
>> db=# select * from pg_compression_opt ;
>> cmoptoid | cmname | cmhandler | cmoptions
>> ----------+--------+------------------------------+-----------
>> 28198689 | ts1 | tsvector_compression_handler |
>> (1 row)
>>
>> DROP TABLE t;
>>
>> db=# select * from pg_compression_opt ;
>> cmoptoid | cmname | cmhandler | cmoptions
>> ----------+--------+------------------------------+-----------
>> 28198689 | ts1 | tsvector_compression_handler |
>> (1 row)
>>
>> db=# DROP COMPRESSION METHOD ts1;
>> ERROR: cannot drop compression method ts1 because other objects
>> depend on it
>> DETAIL: compression options for ts1 depends on compression method
>> ts1
>> HINT: Use DROP ... CASCADE to drop the dependent objects too.
>>
>> I believe the pg_compression_opt is actually linked to pg_attribute,
>> in which case it should include (attrelid,attnum), and should be
>> dropped when the table is dropped.
>>
>> I suppose it was done this way to work around the lack of
>> recompression (i.e. the compressed value might have ended in other
>> table), but that is no longer true.
>
> Good point, since there is recompression now, the options could be
> safely removed in case of dropping table. It will complicate pg_upgrade
> but I think this is solvable.
>

+1 to do that. I've never dealt with pg_upgrade, but I suppose this
shouldn't be more complicated than for custom data types, right?

>>
>> A few more comments:
>>
>> 1) The patch makes optionListToArray (in foreigncmds.c) non-static,
>> but it's not used anywhere. So this seems like change that is no
>> longer necessary.
>
> I use this function in CreateCompressionOptions.
>

Ah, my mistake. I only did 'git grep' which however does not search in
new files (not added to git). But it seems a bit strange to have the
function in foreigncmds.c, though, now that we use it outside of FDWs.

>>
>> 2) I see we add two un-reserved keywords in gram.y - COMPRESSION and
>> COMPRESSED. Perhaps COMPRESSION would be enough? I mean, we could do
>>
>> CREATE TABLE t (c TEXT COMPRESSION cm1);
>> ALTER ... SET COMPRESSION name ...
>> ALTER ... SET COMPRESSION none;
>>
>> Although I agree the "SET COMPRESSION none" is a bit strange.
>
> I agree, I've already changed syntax for the next version of the patch.
> It's COMPRESSION instead of COMPRESSED and DROP COMPRESSION instead of
> SET NOT COMPRESSED. Minus one word from grammar and it looks nicer.
>

I'm not sure DROP COMPRESSION is a good idea. It implies that the data
will be uncompressed, but I assume it merely switches to the built-in
compression (pglz), right? Although "SET COMPRESSION none" has the same
issue ...

BTW, when you do DROP COMPRESSION (or whatever syntax we end up using),
will that remove the dependencies on the compression method? I haven't
tried, so maybe it does.

>>
>> 3) heap_prepare_insert uses this chunk of code
>>
>> + else if (HeapTupleHasExternal(tup)
>> + || RelationGetDescr(relation)->tdflags &
>> TD_ATTR_CUSTOM_COMPRESSED
>> + || HeapTupleHasCustomCompressed(tup)
>> + || tup->t_len > TOAST_TUPLE_THRESHOLD)
>>
>> Shouldn't that be rather
>>
>> + else if (HeapTupleHasExternal(tup)
>> + || (RelationGetDescr(relation)->tdflags &
>> TD_ATTR_CUSTOM_COMPRESSED
>> + && HeapTupleHasCustomCompressed(tup))
>> + || tup->t_len > TOAST_TUPLE_THRESHOLD)
>
> These conditions used for opposite directions.
> HeapTupleHasCustomCompressed(tup) is true if tuple has compressed
> datums inside and we need to decompress them first, and
> TD_ATTR_CUSTOM_COMPRESSED flag means that relation we're putting the
> data have columns with custom compression and maybe we need to compress
> datums in current tuple.
>

Ah, right, now it makes sense. Thanks for explaining.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>, pgsql-hackers(at)postgresql(dot)org, Ildar Musin <i(dot)musin(at)postgrespro(dot)ru>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2017-11-30 20:51:55
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tomas Vondra wrote:

> On 11/30/2017 04:20 PM, Ildus Kurbangaliev wrote:

> > CREATE COMPRESSION METHOD ts1 FOR tsvector HANDLER
> > tsvector_compression_handler;
>
> Understood. Good to know you've considered it, and I agree it doesn't
> need to be there from the start (which makes the patch simpler).

Just passing by, but wouldn't this fit in the ACCESS METHOD group of
commands? So this could be simplified down to
CREATE ACCESS METHOD ts1 TYPE COMPRESSION
we have that for indexes and there are patches flying for heap storage,
sequences, etc. I think that's simpler than trying to invent all new
commands here. Then (in a future patch) you can use ALTER TYPE to
define compression for that type, or even add a column-level option to
reference a specific compression method.

--
Álvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Cc:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>, pgsql-hackers(at)postgresql(dot)org, Ildar Musin <i(dot)musin(at)postgrespro(dot)ru>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2017-12-01 14:10:23
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 11/30/2017 09:51 PM, Alvaro Herrera wrote:
> Tomas Vondra wrote:
>
>> On 11/30/2017 04:20 PM, Ildus Kurbangaliev wrote:
>
>>> CREATE COMPRESSION METHOD ts1 FOR tsvector HANDLER
>>> tsvector_compression_handler;
>>
>> Understood. Good to know you've considered it, and I agree it doesn't
>> need to be there from the start (which makes the patch simpler).
>
> Just passing by, but wouldn't this fit in the ACCESS METHOD group of
> commands? So this could be simplified down to
> CREATE ACCESS METHOD ts1 TYPE COMPRESSION
> we have that for indexes and there are patches flying for heap storage,
> sequences, etc. I think that's simpler than trying to invent all new
> commands here. Then (in a future patch) you can use ALTER TYPE to
> define compression for that type, or even add a column-level option to
> reference a specific compression method.
>

I think that would conflate two very different concepts. In my mind,
access methods define how rows are stored. Compression methods are an
orthogonal concept, e.g. you can compress a value (using a custom
compression algorithm) and store it in an index (using whatever access
method it's using). So not only access methods operate on rows (while
compression operates on varlena values), but you can combine those two
things together. I don't see how you could do that if both are defined
as "access methods" ...

Furthermore, the "TYPE" in CREATE COMPRESSION method was meant to
restrict the compression algorithm to a particular data type (so, if it
relies on tsvector, you can't apply it to text columns). Which is very
different from "TYPE COMPRESSION" in CREATE ACCESS METHOD.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Ildar Musin <i(dot)musin(at)postgrespro(dot)ru>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2017-12-01 14:23:07
Message-ID:	CA+TgmoZFEhqxXBOONMQeCZ09AJGqE=H2UMySF4zHZsatA0_E2Q@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Nov 30, 2017 at 2:47 PM, Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> OK. I think it's a nice use case (and nice gains on the compression
> ratio), demonstrating the datatype-aware compression. The question is
> why shouldn't this be built into the datatypes directly?

Tomas, thanks for running benchmarks of this. I was surprised to see
how little improvement there was from other modern compression
methods, although lz4 did appear to be a modest win on both size and
speed. But I share your intuition that a lot of the interesting work
is in datatype-specific compression algorithms. I have noticed in a
number of papers that I've read that teaching other parts of the
system to operate directly on the compressed data, especially for
column stores, is a critical performance optimization; of course, that
only makes sense if the compression is datatype-specific. I don't
know exactly what that means for the design of this patch, though.

As a general point, no matter which way you go, you have to somehow
deal with on-disk compatibility. If you want to build in compression
to the datatype itself, you need to find at least one bit someplace to
mark the fact that you applied built-in compression. If you want to
build it in as a separate facility, you need to denote the compression
used someplace else. I haven't looked at how this patch does it, but
the proposal in the past has been to add a value to vartag_external.
One nice thing about the latter method is that it can be used for any
data type generically, regardless of how much bit-space is available
in the data type representation itself. It's realistically hard to
think of a data-type that has no bit space available anywhere but is
still subject to data-type specific compression; bytea definitionally
has no bit space but is also can't benefit from special-purpose
compression, whereas even something like text could be handled by
starting the varlena with a NUL byte to indicate compressed data
following. However, you'd have to come up with a different trick for
each data type. Piggybacking on the TOAST machinery avoids that. It
also implies that we only try to compress values that are "big", which
is probably be desirable if we're talking about a kind of compression
that makes comprehending the value slower. Not all types of
compression do, cf. commit 145343534c153d1e6c3cff1fa1855787684d9a38,
and for those that don't it probably makes more sense to just build it
into the data type.

All of that is a somewhat separate question from whether we should
have CREATE / DROP COMPRESSION, though (or Alvaro's proposal of using
the ACCESS METHOD stuff instead). Even if we agree that piggybacking
on TOAST is a good way to implement pluggable compression methods, it
doesn't follow that the compression method is something that should be
attached to the datatype from the outside; it could be built into it
in a deep way. For example, "packed" varlenas (1-byte header) are a
form of compression, and the default functions for detoasting always
produced unpacked values, but the operators for the text data type
know how to operate on the packed representation. That's sort of a
trivial example, but it might well be that there are other cases where
we can do something similar. Maybe jsonb, for example, can compress
data in such a way that some of the jsonb functions can operate
directly on the compressed representation -- perhaps the number of
keys is easily visible, for example, or maybe more. In this view of
the world, each data type should get to define its own compression
method (or methods) but they are hard-wired into the datatype and you
can't add more later, or if you do, you lose the advantages of the
hard-wired stuff.

BTW, another related concept that comes up a lot in discussions of
this area is that we could do a lot better compression of columns if
we had some place to store a per-column dictionary. I don't really
know how to make that work. We could have a catalog someplace that
stores an opaque blob for each column configured to use a compression
method, and let the compression method store whatever it likes in
there. That's probably fine if you are compressing the whole table at
once and the blob is static thereafter. But if you want to update
that blob as you see new column values there seem to be almost
insurmountable problems.

To be clear, I'm not trying to load this patch down with a requirement
to solve every problem in the universe. On the other hand, I think it
would be easy to beat a patch like this into shape in a fairly
mechanical way and then commit-and-forget. That might be leaving a
lot of money on the table; I'm glad you are thinking about the bigger
picture and hope that my thoughts here somehow contribute.

Thanks,

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Ildar Musin <i(dot)musin(at)postgrespro(dot)ru>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2017-12-01 15:18:49
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 12/01/2017 03:23 PM, Robert Haas wrote:
> On Thu, Nov 30, 2017 at 2:47 PM, Tomas Vondra
> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> OK. I think it's a nice use case (and nice gains on the compression
>> ratio), demonstrating the datatype-aware compression. The question is
>> why shouldn't this be built into the datatypes directly?
>
> Tomas, thanks for running benchmarks of this. I was surprised to see
> how little improvement there was from other modern compression
> methods, although lz4 did appear to be a modest win on both size and
> speed. But I share your intuition that a lot of the interesting work
> is in datatype-specific compression algorithms. I have noticed in a
> number of papers that I've read that teaching other parts of the
> system to operate directly on the compressed data, especially for
> column stores, is a critical performance optimization; of course, that
> only makes sense if the compression is datatype-specific. I don't
> know exactly what that means for the design of this patch, though.
>

It has very little impact on this patch, as it has nothing to do with
columnar storage. That is, each value is compressed independently.

Column stores exploit the fact that they get a vector of values,
compressed in some data-aware way. E.g. some form of RLE or dictionary
compression, which allows them to evaluate expressions on the compressed
vector. But that's irrelevant here, we only get row-by-row execution.

Note: The idea to build dictionary for the whole jsonb column (which
this patch should allow) does not make it "columnar compression" in the
"column store" way. The executor will still get the decompressed value.

> As a general point, no matter which way you go, you have to somehow
> deal with on-disk compatibility. If you want to build in compression
> to the datatype itself, you need to find at least one bit someplace to
> mark the fact that you applied built-in compression. If you want to
> build it in as a separate facility, you need to denote the compression
> used someplace else. I haven't looked at how this patch does it, but
> the proposal in the past has been to add a value to vartag_external.

AFAICS the patch does that by setting a bit in the varlena header, and
then adding OID of the compression method after the varlena header. So
you get (verlena header + OID + data).

This has good and bad consequences.

Good: It's transparent for the datatype, so it does not have to worry
about the custom compression at all (and it may change arbitrarily).

Bad: It's transparent for the datatype, so it can't operate directly on
the compressed representation.

I don't think this is an argument against the patch, though. If the
datatype can support intelligent compression (and execution without
decompression), it has to be done in the datatype anyway.

> One nice thing about the latter method is that it can be used for any
> data type generically, regardless of how much bit-space is available
> in the data type representation itself. It's realistically hard to
> think of a data-type that has no bit space available anywhere but is
> still subject to data-type specific compression; bytea definitionally
> has no bit space but is also can't benefit from special-purpose
> compression, whereas even something like text could be handled by
> starting the varlena with a NUL byte to indicate compressed data
> following. However, you'd have to come up with a different trick for
> each data type. Piggybacking on the TOAST machinery avoids that. It
> also implies that we only try to compress values that are "big", which
> is probably be desirable if we're talking about a kind of compression
> that makes comprehending the value slower. Not all types of
> compression do, cf. commit 145343534c153d1e6c3cff1fa1855787684d9a38,
> and for those that don't it probably makes more sense to just build it
> into the data type.
>
> All of that is a somewhat separate question from whether we should
> have CREATE / DROP COMPRESSION, though (or Alvaro's proposal of using
> the ACCESS METHOD stuff instead). Even if we agree that piggybacking
> on TOAST is a good way to implement pluggable compression methods, it
> doesn't follow that the compression method is something that should be
> attached to the datatype from the outside; it could be built into it
> in a deep way. For example, "packed" varlenas (1-byte header) are a
> form of compression, and the default functions for detoasting always
> produced unpacked values, but the operators for the text data type
> know how to operate on the packed representation. That's sort of a
> trivial example, but it might well be that there are other cases where
> we can do something similar. Maybe jsonb, for example, can compress
> data in such a way that some of the jsonb functions can operate
> directly on the compressed representation -- perhaps the number of
> keys is easily visible, for example, or maybe more. In this view of
> the world, each data type should get to define its own compression
> method (or methods) but they are hard-wired into the datatype and you
> can't add more later, or if you do, you lose the advantages of the
> hard-wired stuff.
>

I agree with these thoughts in general, but I'm not quite sure what is
your conclusion regarding the patch.

The patch allows us to define custom compression methods that are
entirely transparent for the datatype machinery, i.e. allow compression
even for data types that did not consider compression at all. That seems
valuable to me.

Of course, if the same compression logic can be built into the datatype
itself, it may allow additional benefits (like execution on compressed
data directly).

I don't see these two approaches as conflicting.

>
> BTW, another related concept that comes up a lot in discussions of
> this area is that we could do a lot better compression of columns if
> we had some place to store a per-column dictionary. I don't really
> know how to make that work. We could have a catalog someplace that
> stores an opaque blob for each column configured to use a compression
> method, and let the compression method store whatever it likes in
> there. That's probably fine if you are compressing the whole table at
> once and the blob is static thereafter. But if you want to update
> that blob as you see new column values there seem to be almost
> insurmountable problems.
>

Well, that's kinda the idea behind the configure/drop methods in the
compression handler, and Ildus already did implement such dictionary
compression for the jsonb data type, see:

https://github.com/postgrespro/jsonbd

Essentially that stores the dictionary in a table, managed by a bunch of
background workers.

>
> To be clear, I'm not trying to load this patch down with a requirement
> to solve every problem in the universe. On the other hand, I think it
> would be easy to beat a patch like this into shape in a fairly
> mechanical way and then commit-and-forget. That might be leaving a
> lot of money on the table; I'm glad you are thinking about the bigger
> picture and hope that my thoughts here somehow contribute.
>

Thanks ;-)

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Ildar Musin <i(dot)musin(at)postgrespro(dot)ru>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2017-12-01 19:20:43
Message-ID:	CA+TgmoYGvNV7GqKLLCYcdFEhiANHA_mXxtf_EB=Pd--y14sMQg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Dec 1, 2017 at 10:18 AM, Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> It has very little impact on this patch, as it has nothing to do with
> columnar storage. That is, each value is compressed independently.

I understand that this patch is not about columnar storage, but I
think the idea that we may want to operate on the compressed data
directly is not only applicable to that case.

> I agree with these thoughts in general, but I'm not quite sure what is
> your conclusion regarding the patch.

I have not reached one. Sometimes I like to discuss problems before
deciding what I think. :-)

It does seem to me that the patch may be aiming at a relatively narrow
target in a fairly large problem space, but I don't know whether to
label that as short-sightedness or prudent incrementalism.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>, pgsql-hackers(at)postgresql(dot)org, Ildar Musin <i(dot)musin(at)postgrespro(dot)ru>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2017-12-01 19:38:42
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tomas Vondra wrote:

> On 11/30/2017 09:51 PM, Alvaro Herrera wrote:

> > Just passing by, but wouldn't this fit in the ACCESS METHOD group of
> > commands? So this could be simplified down to
> > CREATE ACCESS METHOD ts1 TYPE COMPRESSION
> > we have that for indexes and there are patches flying for heap storage,
> > sequences, etc.
>
> I think that would conflate two very different concepts. In my mind,
> access methods define how rows are stored.

In mine, they define how things are accessed (i.e. more general than
what you're thinking). We *currently* use them to store rows [in
indexes], but there is no reason why we couldn't expand that.

So we group access methods in "types"; the current type we have is for
indexes, and methods in that type define how are indexes accessed. This
new type would indicate how would values be compressed. I disagree that
there is no parallel there.

I'm trying to avoid pointless proliferation of narrowly defined DDL
commands.

> Furthermore, the "TYPE" in CREATE COMPRESSION method was meant to
> restrict the compression algorithm to a particular data type (so, if it
> relies on tsvector, you can't apply it to text columns).

Yes, of course. I'm saying that the "datatype" property of a
compression access method would be declared somewhere else, not in the
TYPE clause of the CREATE ACCESS METHOD command. Perhaps it makes sense
to declare that a certain compression access method is good only for a
certain data type, and then you can put that in the options clause,
"CREATE ACCESS METHOD hyperz TYPE COMPRESSION WITH (type = tsvector)".
But many compression access methods would be general in nature and so
could be used for many datatypes (say, snappy).

To me it makes sense to say "let's create this method which is for data
compression" (CREATE ACCESS METHOD hyperz TYPE COMPRESSION) followed by
either "let's use this new compression method for the type tsvector"
(ALTER TYPE tsvector SET COMPRESSION hyperz) or "let's use this new
compression method for the column tc" (ALTER TABLE ALTER COLUMN tc SET
COMPRESSION hyperz).

--
Álvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>
Cc:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Евгений Шишкин <itparanoia(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, Oleg Bartunov <obartunov(at)gmail(dot)com>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2017-12-01 19:48:59
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Ildus Kurbangaliev wrote:

> If the table is big, decompression could take an eternity. That's why i
> decided to only to disable it and the data could be decompressed using
> compression options.
>
> My idea was to keep compression options forever, since there will not
> be much of them in one database. Still that requires that extension is
> not removed.
>
> I will try to find a way how to recompress data first in case it moves
> to another table.

I think what you should do is add a dependency between a column that
compresses using a method, and that method. So the method cannot be
dropped and leave compressed data behind. Since the method is part of
the extension, the extension cannot be dropped either. If you ALTER
the column so that it uses another compression method, then the table is
rewritten and the dependency is removed; once you do that for all the
columns that use the compression method, the compression method can be
dropped.

Maybe our dependency code needs to be extended in order to support this.
I think the current logic would drop the column if you were to do "DROP
COMPRESSION .. CASCADE", but I'm not sure we'd see that as a feature.
I'd rather have DROP COMPRESSION always fail instead until no columns
use it. Let's hear other's opinions on this bit though.

--
Álvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>
Cc:	Евгений Шишкин <itparanoia(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, Oleg Bartunov <obartunov(at)gmail(dot)com>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2017-12-01 20:47:43
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 12/01/2017 08:48 PM, Alvaro Herrera wrote:
> Ildus Kurbangaliev wrote:
>
>> If the table is big, decompression could take an eternity. That's why i
>> decided to only to disable it and the data could be decompressed using
>> compression options.
>>
>> My idea was to keep compression options forever, since there will not
>> be much of them in one database. Still that requires that extension is
>> not removed.
>>
>> I will try to find a way how to recompress data first in case it moves
>> to another table.
>
> I think what you should do is add a dependency between a column that
> compresses using a method, and that method. So the method cannot be
> dropped and leave compressed data behind. Since the method is part of
> the extension, the extension cannot be dropped either. If you ALTER
> the column so that it uses another compression method, then the table is
> rewritten and the dependency is removed; once you do that for all the
> columns that use the compression method, the compression method can be
> dropped.
>

+1 to do the rewrite, just like for other similar ALTER TABLE commands

>
> Maybe our dependency code needs to be extended in order to support this.
> I think the current logic would drop the column if you were to do "DROP
> COMPRESSION .. CASCADE", but I'm not sure we'd see that as a feature.
> I'd rather have DROP COMPRESSION always fail instead until no columns
> use it. Let's hear other's opinions on this bit though.
>

Why should this behave differently compared to data types? Seems quite
against POLA, if you ask me ...

If you want to remove the compression, you can do the SET NOT COMPRESSED
(or whatever syntax we end up using), and then DROP COMPRESSION METHOD.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Ildar Musin <i(dot)musin(at)postgrespro(dot)ru>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2017-12-01 21:06:00
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 12/01/2017 08:20 PM, Robert Haas wrote:
> On Fri, Dec 1, 2017 at 10:18 AM, Tomas Vondra
> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> It has very little impact on this patch, as it has nothing to do with
>> columnar storage. That is, each value is compressed independently.
>
> I understand that this patch is not about columnar storage, but I
> think the idea that we may want to operate on the compressed data
> directly is not only applicable to that case.
>

Yeah. To clarify, my point was that column stores benefit from
compressing many values at once, and then operating on this compressed
vector. That is not what this patch is doing (or can do), of course.

But I certainly do agree that if the compression can be integrated into
the data type, allowing processing on compressed representation, then
that will beat whatever this patch is doing, of course ...

>>
>> I agree with these thoughts in general, but I'm not quite sure
>> what is your conclusion regarding the patch.
>
> I have not reached one. Sometimes I like to discuss problems before
> deciding what I think. :-)
>

That's lame! Let's make decisions without discussion ;-)

>
> It does seem to me that the patch may be aiming at a relatively narrow
> target in a fairly large problem space, but I don't know whether to
> label that as short-sightedness or prudent incrementalism.
>

I don't know either. I don't think people will start switching their
text columns to lz4 just because they can, or because they get 4% space
reduction compared to pglz.

But the ability to build per-column dictionaries seems quite powerful, I
guess. And I don't think that can be easily built directly into JSONB,
because we don't have a way to provide information about the column
(i.e. how would you fetch the correct dictionary?).

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Cc:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Ildar Musin <i(dot)musin(at)postgrespro(dot)ru>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2017-12-01 21:06:21
Message-ID:	CA+Tgmob38wGgpb7ttOOMyXQXWdeq+axHBPN=wk5-qq0h1yUzeQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Dec 1, 2017 at 2:38 PM, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org> wrote:
> In mine, they define how things are accessed (i.e. more general than
> what you're thinking). We *currently* use them to store rows [in
> indexes], but there is no reason why we couldn't expand that.
>
> So we group access methods in "types"; the current type we have is for
> indexes, and methods in that type define how are indexes accessed. This
> new type would indicate how would values be compressed. I disagree that
> there is no parallel there.

+1.

> I'm trying to avoid pointless proliferation of narrowly defined DDL
> commands.

I also think that's an important goal.

> Yes, of course. I'm saying that the "datatype" property of a
> compression access method would be declared somewhere else, not in the
> TYPE clause of the CREATE ACCESS METHOD command. Perhaps it makes sense
> to declare that a certain compression access method is good only for a
> certain data type, and then you can put that in the options clause,
> "CREATE ACCESS METHOD hyperz TYPE COMPRESSION WITH (type = tsvector)".
> But many compression access methods would be general in nature and so
> could be used for many datatypes (say, snappy).
>
> To me it makes sense to say "let's create this method which is for data
> compression" (CREATE ACCESS METHOD hyperz TYPE COMPRESSION) followed by
> either "let's use this new compression method for the type tsvector"
> (ALTER TYPE tsvector SET COMPRESSION hyperz) or "let's use this new
> compression method for the column tc" (ALTER TABLE ALTER COLUMN tc SET
> COMPRESSION hyperz).

+1 to this, too.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Ildar Musin <i(dot)musin(at)postgrespro(dot)ru>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2017-12-01 21:14:58
Message-ID:	CA+TgmoZY+TB-wa6QgwsaL-MHCJAhR9n0ifEKTDq8PTxBuqEW-w@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Dec 1, 2017 at 4:06 PM, Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>>> I agree with these thoughts in general, but I'm not quite sure
>>> what is your conclusion regarding the patch.
>>
>> I have not reached one. Sometimes I like to discuss problems before
>> deciding what I think. :-)
>
> That's lame! Let's make decisions without discussion ;-)

Oh, right. What was I thinking?

>> It does seem to me that the patch may be aiming at a relatively narrow
>> target in a fairly large problem space, but I don't know whether to
>> label that as short-sightedness or prudent incrementalism.
>
> I don't know either. I don't think people will start switching their
> text columns to lz4 just because they can, or because they get 4% space
> reduction compared to pglz.

Honestly, if we can give everybody a 4% space reduction by switching
to lz4, I think that's totally worth doing -- but let's not make
people choose it, let's make it the default going forward, and keep
pglz support around so we don't break pg_upgrade compatibility (and so
people can continue to choose it if for some reason it works better in
their use case). That kind of improvement is nothing special in a
specific workload, but TOAST is a pretty general-purpose mechanism. I
have become, through a few bitter experiences, a strong believer in
the value of trying to reduce our on-disk footprint, and knocking 4%
off the size of every TOAST table in the world does not sound
worthless to me -- even though context-aware compression can doubtless
do a lot better.

> But the ability to build per-column dictionaries seems quite powerful, I
> guess. And I don't think that can be easily built directly into JSONB,
> because we don't have a way to provide information about the column
> (i.e. how would you fetch the correct dictionary?).

That's definitely a problem, but I think we should mull it over a bit
more before giving up. I have a few thoughts, but the part of my life
that doesn't happen on the PostgreSQL mailing list precludes
expounding on them right this minute.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Cc:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>, pgsql-hackers(at)postgresql(dot)org, Ildar Musin <i(dot)musin(at)postgrespro(dot)ru>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2017-12-01 21:19:26
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 12/01/2017 08:38 PM, Alvaro Herrera wrote:
> Tomas Vondra wrote:
>
>> On 11/30/2017 09:51 PM, Alvaro Herrera wrote:
>
>>> Just passing by, but wouldn't this fit in the ACCESS METHOD group of
>>> commands? So this could be simplified down to
>>> CREATE ACCESS METHOD ts1 TYPE COMPRESSION
>>> we have that for indexes and there are patches flying for heap storage,
>>> sequences, etc.
>>
>> I think that would conflate two very different concepts. In my mind,
>> access methods define how rows are stored.
>
> In mine, they define how things are accessed (i.e. more general than
> what you're thinking). We *currently* use them to store rows [in
> indexes], but there is no reason why we couldn't expand that.
>

Not sure I follow. My argument was not as much about whether the rows
are stored as rows or in some other (columnar) format, but that access
methods deal with "tuples" (i.e. row in the "logical" way). I assume
that even if we end up implementing other access method types, they will
still be tuple-based.

OTOH compression methods (at least as introduced by this patch) operate
on individual values, and have very little to do with access to the
value (in a sense it's a transparent thing).

>
> So we group access methods in "types"; the current type we have is for
> indexes, and methods in that type define how are indexes accessed. This
> new type would indicate how would values be compressed. I disagree that
> there is no parallel there.
>
> I'm trying to avoid pointless proliferation of narrowly defined DDL
> commands.
>

Of course, the opposite case is using the same DDL for very different
concepts (although I understand you don't see it that way).

But in fairness, I don't really care if we call this COMPRESSION METHOD
or ACCESS METHOD or DARTH VADER ...

>> Furthermore, the "TYPE" in CREATE COMPRESSION method was meant to
>> restrict the compression algorithm to a particular data type (so, if it
>> relies on tsvector, you can't apply it to text columns).
>
> Yes, of course. I'm saying that the "datatype" property of a
> compression access method would be declared somewhere else, not in the
> TYPE clause of the CREATE ACCESS METHOD command. Perhaps it makes sense
> to declare that a certain compression access method is good only for a
> certain data type, and then you can put that in the options clause,
> "CREATE ACCESS METHOD hyperz TYPE COMPRESSION WITH (type = tsvector)".
> But many compression access methods would be general in nature and so
> could be used for many datatypes (say, snappy).
>
> To me it makes sense to say "let's create this method which is for data
> compression" (CREATE ACCESS METHOD hyperz TYPE COMPRESSION) followed by
> either "let's use this new compression method for the type tsvector"
> (ALTER TYPE tsvector SET COMPRESSION hyperz) or "let's use this new
> compression method for the column tc" (ALTER TABLE ALTER COLUMN tc SET
> COMPRESSION hyperz).
>

The WITH syntax does not seem particularly pretty to me, TBH. I'd be
much happier with "TYPE tsvector" and leaving WITH for the options
specific to each compression method.

FWIW I think syntax is the least critical part of this patch. It's ~1%
of the patch, and the gram.y additions are rather trivial.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Ildar Musin <i(dot)musin(at)postgrespro(dot)ru>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2017-12-01 21:52:17
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2017-12-01 16:14:58 -0500, Robert Haas wrote:
> Honestly, if we can give everybody a 4% space reduction by switching
> to lz4, I think that's totally worth doing -- but let's not make
> people choose it, let's make it the default going forward, and keep
> pglz support around so we don't break pg_upgrade compatibility (and so
> people can continue to choose it if for some reason it works better in
> their use case). That kind of improvement is nothing special in a
> specific workload, but TOAST is a pretty general-purpose mechanism. I
> have become, through a few bitter experiences, a strong believer in
> the value of trying to reduce our on-disk footprint, and knocking 4%
> off the size of every TOAST table in the world does not sound
> worthless to me -- even though context-aware compression can doubtless
> do a lot better.

+1. It's also a lot faster, and I've seen way way to many workloads with
50%+ time spent in pglz.

Greetings,

Andres Freund

From:	Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>, Евгений Шишкин <itparanoia(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, Oleg Bartunov <obartunov(at)gmail(dot)com>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2017-12-01 21:53:40
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tomas Vondra wrote:

> On 12/01/2017 08:48 PM, Alvaro Herrera wrote:

> > Maybe our dependency code needs to be extended in order to support this.
> > I think the current logic would drop the column if you were to do "DROP
> > COMPRESSION .. CASCADE", but I'm not sure we'd see that as a feature.
> > I'd rather have DROP COMPRESSION always fail instead until no columns
> > use it. Let's hear other's opinions on this bit though.
>
> Why should this behave differently compared to data types? Seems quite
> against POLA, if you ask me ...

OK, DROP TYPE sounds good enough precedent, so +1 on that.

--
Álvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>
To:	Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Cc:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org, Ildar Musin <i(dot)musin(at)postgrespro(dot)ru>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2017-12-01 21:55:52
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 1 Dec 2017 16:38:42 -0300
Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org> wrote:

>
> To me it makes sense to say "let's create this method which is for
> data compression" (CREATE ACCESS METHOD hyperz TYPE COMPRESSION)
> followed by either "let's use this new compression method for the
> type tsvector" (ALTER TYPE tsvector SET COMPRESSION hyperz) or "let's
> use this new compression method for the column tc" (ALTER TABLE ALTER
> COLUMN tc SET COMPRESSION hyperz).
>

Hi, I think if CREATE ACCESS METHOD can be used for compression, then it
could be nicer than CREATE COMPRESSION METHOD. I just don't
know that compression could go as access method or not. Anyway
it's easy to change syntax and I don't mind to do it, if it will be
neccessary for the patch to be commited.

--
----
Regards,
Ildus Kurbangaliev

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	Andres Freund <andres(at)anarazel(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Ildar Musin <i(dot)musin(at)postgrespro(dot)ru>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2017-12-02 15:04:52
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 12/01/2017 10:52 PM, Andres Freund wrote:
> On 2017-12-01 16:14:58 -0500, Robert Haas wrote:
>> Honestly, if we can give everybody a 4% space reduction by
>> switching to lz4, I think that's totally worth doing -- but let's
>> not make people choose it, let's make it the default going forward,
>> and keep pglz support around so we don't break pg_upgrade
>> compatibility (and so people can continue to choose it if for some
>> reason it works better in their use case). That kind of improvement
>> is nothing special in a specific workload, but TOAST is a pretty
>> general-purpose mechanism. I have become, through a few bitter
>> experiences, a strong believer in the value of trying to reduce our
>> on-disk footprint, and knocking 4% off the size of every TOAST
>> table in the world does not sound worthless to me -- even though
>> context-aware compression can doubtless do a lot better.
>
> +1. It's also a lot faster, and I've seen way way to many workloads
> with 50%+ time spent in pglz.
>

TBH the 4% figure is something I mostly made up (I'm fake news!). On the
mailing list archive (which I believe is pretty compressible) I observed
something like 2.5% size reduction with lz4 compared to pglz, at least
with the compression levels I've used ...

Other algorithms (e.g. zstd) got significantly better compression (25%)
compared to pglz, but in exchange for longer compression. I'm sure we
could lower compression level to make it faster, but that will of course
hurt the compression ratio.

I don't think switching to a different compression algorithm is a way
forward - it was proposed and explored repeatedly in the past, and every
time it failed for a number of reasons, most of which are still valid.

Firstly, it's going to be quite hard (or perhaps impossible) to find an
algorithm that is "universally better" than pglz. Some algorithms do
work better for text documents, some for binary blobs, etc. I don't
think there's a win-win option.

Sure, there are workloads where pglz performs poorly (I've seen such
cases too), but IMHO that's more an argument for the custom compression
method approach. pglz gives you good default compression in most cases,
and you can change it for columns where it matters, and where a
different space/time trade-off makes sense.

Secondly, all the previous attempts ran into some legal issues, i.e.
licensing and/or patents. Maybe the situation changed since then (no
idea, haven't looked into that), but in the past the "pluggable"
approach was proposed as a way to address this.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	konstantin knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Ildar Musin <i(dot)musin(at)postgrespro(dot)ru>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2017-12-02 20:24:17
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Dec 2, 2017, at 6:04 PM, Tomas Vondra wrote:

> On 12/01/2017 10:52 PM, Andres Freund wrote:
>> On 2017-12-01 16:14:58 -0500, Robert Haas wrote:
>>> Honestly, if we can give everybody a 4% space reduction by
>>> switching to lz4, I think that's totally worth doing -- but let's
>>> not make people choose it, let's make it the default going forward,
>>> and keep pglz support around so we don't break pg_upgrade
>>> compatibility (and so people can continue to choose it if for some
>>> reason it works better in their use case). That kind of improvement
>>> is nothing special in a specific workload, but TOAST is a pretty
>>> general-purpose mechanism. I have become, through a few bitter
>>> experiences, a strong believer in the value of trying to reduce our
>>> on-disk footprint, and knocking 4% off the size of every TOAST
>>> table in the world does not sound worthless to me -- even though
>>> context-aware compression can doubtless do a lot better.
>>
>> +1. It's also a lot faster, and I've seen way way to many workloads
>> with 50%+ time spent in pglz.
>>
>
> TBH the 4% figure is something I mostly made up (I'm fake news!). On the
> mailing list archive (which I believe is pretty compressible) I observed
> something like 2.5% size reduction with lz4 compared to pglz, at least
> with the compression levels I've used ...
>
> Other algorithms (e.g. zstd) got significantly better compression (25%)
> compared to pglz, but in exchange for longer compression. I'm sure we
> could lower compression level to make it faster, but that will of course
> hurt the compression ratio.
>
> I don't think switching to a different compression algorithm is a way
> forward - it was proposed and explored repeatedly in the past, and every
> time it failed for a number of reasons, most of which are still valid.
>
>
> Firstly, it's going to be quite hard (or perhaps impossible) to find an
> algorithm that is "universally better" than pglz. Some algorithms do
> work better for text documents, some for binary blobs, etc. I don't
> think there's a win-win option.
>
> Sure, there are workloads where pglz performs poorly (I've seen such
> cases too), but IMHO that's more an argument for the custom compression
> method approach. pglz gives you good default compression in most cases,
> and you can change it for columns where it matters, and where a
> different space/time trade-off makes sense.
>
>
> Secondly, all the previous attempts ran into some legal issues, i.e.
> licensing and/or patents. Maybe the situation changed since then (no
> idea, haven't looked into that), but in the past the "pluggable"
> approach was proposed as a way to address this.
>
>

May be it will be interesting for you to see the following results of applying page-level compression (CFS in PgPro-EE) to pgbench data:

Configuration
Size (Gb)
Time (sec)
vanilla postgres
15.31
92
zlib (default level)
2.37
284
zlib (best speed)
2.43
191
postgres internal lz
3.89
214
lz4
4.12
95
snappy (google)
5.18
99
lzfse (apple)
2.80
1099
zstd (facebook)
1.69
125

All algorithms (except zlib) were used with best-speed option: using better compression level usually has not so large impact on compression ratio (<30%), but can significantly increase time (several times).
Certainly pgbench isnot the best candidate for testing compression algorithms: it generates a lot of artificial and redundant data.
But we measured it also on real customers data and still zstd seems to be the best compression methods: provides good compression with smallest CPU overhead.

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Ildar Musin <i(dot)musin(at)postgrespro(dot)ru>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2017-12-02 20:38:39
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On 2017-12-02 16:04:52 +0100, Tomas Vondra wrote:
> Firstly, it's going to be quite hard (or perhaps impossible) to find an
> algorithm that is "universally better" than pglz. Some algorithms do
> work better for text documents, some for binary blobs, etc. I don't
> think there's a win-win option.

lz4 is pretty much there.

> Secondly, all the previous attempts ran into some legal issues, i.e.
> licensing and/or patents. Maybe the situation changed since then (no
> idea, haven't looked into that), but in the past the "pluggable"
> approach was proposed as a way to address this.

Those were pretty bogus. I think we're not doing our users a favor if
they've to download some external projects, then fiddle with things,
just to not choose a compression algorithm that's been known bad for at
least 5+ years. If we've a decent algorithm in-core *and* then allow
extensibility, that's one thing, but keeping the bad and tell forks
"please take our users with this code we give you" is ...

Greetings,

Andres Freund

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	konstantin knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Ildar Musin <i(dot)musin(at)postgrespro(dot)ru>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2017-12-02 21:30:58
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 12/02/2017 09:24 PM, konstantin knizhnik wrote:
>
> On Dec 2, 2017, at 6:04 PM, Tomas Vondra wrote:
>
>> On 12/01/2017 10:52 PM, Andres Freund wrote:
>> ...
>>
>> Other algorithms (e.g. zstd) got significantly better compression (25%)
>> compared to pglz, but in exchange for longer compression. I'm sure we
>> could lower compression level to make it faster, but that will of course
>> hurt the compression ratio.
>>
>> I don't think switching to a different compression algorithm is a way
>> forward - it was proposed and explored repeatedly in the past, and every
>> time it failed for a number of reasons, most of which are still valid.
>>
>>
>> Firstly, it's going to be quite hard (or perhaps impossible) to
>> find an algorithm that is "universally better" than pglz. Some
>> algorithms do work better for text documents, some for binary
>> blobs, etc. I don't think there's a win-win option.
>>
>> Sure, there are workloads where pglz performs poorly (I've seen
>> such cases too), but IMHO that's more an argument for the custom
>> compression method approach. pglz gives you good default
>> compression in most cases, and you can change it for columns where
>> it matters, and where a different space/time trade-off makes
>> sense.
>>
>>
>> Secondly, all the previous attempts ran into some legal issues, i.e.
>> licensing and/or patents. Maybe the situation changed since then (no
>> idea, haven't looked into that), but in the past the "pluggable"
>> approach was proposed as a way to address this.
>>
>>
>
> May be it will be interesting for you to see the following results
> of applying page-level compression (CFS in PgPro-EE) to pgbench
> data:
>

I don't follow. If I understand what CFS does correctly (and I'm mostly
guessing here, because I haven't seen the code published anywhere, and I
assume it's proprietary), it essentially compresses whole 8kB blocks.

I don't know it reorganizes the data into columnar format first, in some
way (to make it more "columnar" which is more compressible), which would
make somewhat similar to page-level compression in Oracle.

But it's clearly a very different approach from what the patch aims to
improve (compressing individual varlena values).

>
> All algorithms (except zlib) were used with best-speed option: using
> better compression level usually has not so large impact on
> compression ratio (<30%), but can significantly increase time
> (several times). Certainly pgbench isnot the best candidate for
> testing compression algorithms: it generates a lot of artificial and
> redundant data. But we measured it also on real customers data and
> still zstd seems to be the best compression methods: provides good
> compression with smallest CPU overhead.
>

I think this really depends on the dataset, and drawing conclusions
based on a single test is somewhat crazy. Especially when it's synthetic
pgbench data with lots of inherent redundancy - sequential IDs, ...

My takeaway from the results is rather that page-level compression may
be very beneficial in some cases, although I wonder how much of that can
be gained by simply using compressed filesystem (thus making it
transparent to PostgreSQL).

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Ildar Musin <i(dot)musin(at)postgrespro(dot)ru>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2017-12-02 21:53:10
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 12/02/2017 09:38 PM, Andres Freund wrote:
> Hi,
>
> On 2017-12-02 16:04:52 +0100, Tomas Vondra wrote:
>> Firstly, it's going to be quite hard (or perhaps impossible) to find an
>> algorithm that is "universally better" than pglz. Some algorithms do
>> work better for text documents, some for binary blobs, etc. I don't
>> think there's a win-win option.
>
> lz4 is pretty much there.
>

That's a matter of opinion, I guess. It's a solid compression algorithm,
that's for sure ...

>> Secondly, all the previous attempts ran into some legal issues, i.e.
>> licensing and/or patents. Maybe the situation changed since then (no
>> idea, haven't looked into that), but in the past the "pluggable"
>> approach was proposed as a way to address this.
>
> Those were pretty bogus.

IANAL so I don't dare to judge on bogusness of such claims. I assume if
we made it optional (e.g. configure/initdb option, it'd be much less of
an issue). Of course, that has disadvantages too (because when you
compile/init with one algorithm, and then find something else would work
better for your data, you have to start from scratch).

>
> I think we're not doing our users a favor if they've to download
> some external projects, then fiddle with things, just to not choose
> a compression algorithm that's been known bad for at least 5+ years.
> If we've a decent algorithm in-core *and* then allow extensibility,
> that's one thing, but keeping the bad and tell forks "please take
> our users with this code we give you" is ...
>

I don't understand what exactly is your issue with external projects,
TBH. I think extensibility is one of the great strengths of Postgres.
It's not all rainbows and unicorns, of course, and it has costs too.

FWIW I don't think pglz is a "known bad" algorithm. Perhaps there are
cases where other algorithms (e.g. lz4) are running circles around it,
particularly when it comes to decompression speed, but I wouldn't say
it's "known bad".

Not sure which forks you're talking about ...

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Oleg Bartunov <obartunov(at)gmail(dot)com>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Ildar Musin <i(dot)musin(at)postgrespro(dot)ru>, Nikita Glukhov <n(dot)gluhov(at)postgrespro(dot)ru>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2017-12-03 02:52:13
Message-ID:	CAF4Au4z-7_Ya4qO=PBK37dyY-Up+CqT0v_ucqXUV8Zanu4eNLw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Dec 2, 2017 at 6:04 PM, Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> On 12/01/2017 10:52 PM, Andres Freund wrote:
>> On 2017-12-01 16:14:58 -0500, Robert Haas wrote:
>>> Honestly, if we can give everybody a 4% space reduction by
>>> switching to lz4, I think that's totally worth doing -- but let's
>>> not make people choose it, let's make it the default going forward,
>>> and keep pglz support around so we don't break pg_upgrade
>>> compatibility (and so people can continue to choose it if for some
>>> reason it works better in their use case). That kind of improvement
>>> is nothing special in a specific workload, but TOAST is a pretty
>>> general-purpose mechanism. I have become, through a few bitter
>>> experiences, a strong believer in the value of trying to reduce our
>>> on-disk footprint, and knocking 4% off the size of every TOAST
>>> table in the world does not sound worthless to me -- even though
>>> context-aware compression can doubtless do a lot better.
>>
>> +1. It's also a lot faster, and I've seen way way to many workloads
>> with 50%+ time spent in pglz.
>>
>
> TBH the 4% figure is something I mostly made up (I'm fake news!). On the
> mailing list archive (which I believe is pretty compressible) I observed
> something like 2.5% size reduction with lz4 compared to pglz, at least
> with the compression levels I've used ...

Nikita Glukhove tested compression on real json data:

Delicious bookmarks (js):

json 1322MB
jsonb 1369MB
jsonbc 931MB 1.5x
jsonb+lz4d 404MB 3.4x

Citus customer reviews (jr):

json 1391MB
jsonb 1574MB
jsonbc 622MB 2.5x
jsonb+lz4d 601MB 2.5x

I also attached a plot with wired tiger size (zstd compression) in Mongodb.
Nikita has more numbers about compression.

>
> Other algorithms (e.g. zstd) got significantly better compression (25%)
> compared to pglz, but in exchange for longer compression. I'm sure we
> could lower compression level to make it faster, but that will of course
> hurt the compression ratio.
>
> I don't think switching to a different compression algorithm is a way
> forward - it was proposed and explored repeatedly in the past, and every
> time it failed for a number of reasons, most of which are still valid.
>
>
> Firstly, it's going to be quite hard (or perhaps impossible) to find an
> algorithm that is "universally better" than pglz. Some algorithms do
> work better for text documents, some for binary blobs, etc. I don't
> think there's a win-win option.
>
> Sure, there are workloads where pglz performs poorly (I've seen such
> cases too), but IMHO that's more an argument for the custom compression
> method approach. pglz gives you good default compression in most cases,
> and you can change it for columns where it matters, and where a
> different space/time trade-off makes sense.
>
>
> Secondly, all the previous attempts ran into some legal issues, i.e.
> licensing and/or patents. Maybe the situation changed since then (no
> idea, haven't looked into that), but in the past the "pluggable"
> approach was proposed as a way to address this.

I don't think so. Pluggable means that now we have more data types,
which don't fit to
the old compression scheme of TOAST and we need better flexibility. I
see in future we
could avoid decompression of the whole toast just to get on key from
document, so we
first slice data and compress each slice separately.

>
>
> regards
>
> --
> Tomas Vondra http://www.2ndQuadrant.com
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>

Attachment	Content-Type	Size
Screen Shot 2017-12-03 at 11.46.14.png	image/png	183.0 KB

From:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Евгений Шишкин <itparanoia(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, Oleg Bartunov <obartunov(at)gmail(dot)com>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2017-12-06 15:07:16
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 1 Dec 2017 21:47:43 +0100
Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:

>
> +1 to do the rewrite, just like for other similar ALTER TABLE commands

Ok. What about the following syntax:

ALTER COLUMN DROP COMPRESSION - removes compression from the column
with the rewrite and removes related compression options, so the user
can drop compression method.

ALTER COLUMN SET COMPRESSION NONE for the cases when
the users want to just disable compression for future tuples. After
that they can keep compressed tuples, or in the case when they have a
large table they can decompress tuples partially using e.g. UPDATE,
and then use ALTER COLUMN DROP COMPRESSION which will be much faster
then.

ALTER COLUMN SET COMPRESSION <cm> WITH <cmoptions> will change
compression for new tuples but will not touch old ones. If the users
want the recompression they can use DROP/SET COMPRESSION combination.

I don't think that SET COMPRESSION with the rewrite of the whole table
will be useful enough on any somewhat big tables and same time big
tables is where the user needs compression the most.

I understand that ALTER with the rewrite sounds logical and much easier
to implement (and it doesn't require Oids in tuples), but it could be
unusable.

--
----
Regards,
Ildus Kurbangaliev

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>
Cc:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Евгений Шишкин <itparanoia(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Oleg Bartunov <obartunov(at)gmail(dot)com>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2017-12-08 20:12:42
Message-ID:	CA+TgmobjDLp5xxodzN+TwhuhdQxM3wJw6AaLatDaBgPQm+N+kw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Dec 6, 2017 at 10:07 AM, Ildus Kurbangaliev
<i(dot)kurbangaliev(at)postgrespro(dot)ru> wrote:
> On Fri, 1 Dec 2017 21:47:43 +0100
> Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> +1 to do the rewrite, just like for other similar ALTER TABLE commands
>
> Ok. What about the following syntax:
>
> ALTER COLUMN DROP COMPRESSION - removes compression from the column
> with the rewrite and removes related compression options, so the user
> can drop compression method.
>
> ALTER COLUMN SET COMPRESSION NONE for the cases when
> the users want to just disable compression for future tuples. After
> that they can keep compressed tuples, or in the case when they have a
> large table they can decompress tuples partially using e.g. UPDATE,
> and then use ALTER COLUMN DROP COMPRESSION which will be much faster
> then.
>
> ALTER COLUMN SET COMPRESSION <cm> WITH <cmoptions> will change
> compression for new tuples but will not touch old ones. If the users
> want the recompression they can use DROP/SET COMPRESSION combination.
>
> I don't think that SET COMPRESSION with the rewrite of the whole table
> will be useful enough on any somewhat big tables and same time big
> tables is where the user needs compression the most.
>
> I understand that ALTER with the rewrite sounds logical and much easier
> to implement (and it doesn't require Oids in tuples), but it could be
> unusable.

The problem with this is that old compression methods can still be
floating around in the table even after you have done SET COMPRESSION
to something else. The table still needs to have a dependency on the
old compression method, because otherwise you might think it's safe to
drop the old one when it really is not. Furthermore, if you do a
pg_upgrade, you've got to preserve that dependency, which means it
would have to show up in a pg_dump --binary-upgrade someplace. It's
not obvious how any of that would work with this syntax.

Maybe a better idea is ALTER COLUMN SET COMPRESSION x1, x2, x3 ...
meaning that x1 is the default for new tuples but x2, x3, etc. are
still allowed if present. If you issue a command that only adds
things to the list, no table rewrite happens, but if you remove
anything, then it does.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Евгений Шишкин <itparanoia(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Oleg Bartunov <obartunov(at)gmail(dot)com>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2017-12-11 12:55:55
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 8 Dec 2017 15:12:42 -0500
Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

>
> Maybe a better idea is ALTER COLUMN SET COMPRESSION x1, x2, x3 ...
> meaning that x1 is the default for new tuples but x2, x3, etc. are
> still allowed if present. If you issue a command that only adds
> things to the list, no table rewrite happens, but if you remove
> anything, then it does.
>

I like this idea, but maybe it should be something like ALTER COLUMN
SET COMPRESSION x1 [ PRESERVE x2, x3 ]? 'PRESERVE' is already used in
syntax and this syntax will show better which one is current and which
ones should be kept.

--
----
Regards,
Ildus Kurbangaliev

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>
Cc:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Евгений Шишкин <itparanoia(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Oleg Bartunov <obartunov(at)gmail(dot)com>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2017-12-11 17:25:30
Message-ID:	CA+Tgmoa7U9Pgc8YV-eRkKAK2s74aL6Yw1Wsj81PTjyrn0Eg99g@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Dec 11, 2017 at 7:55 AM, Ildus Kurbangaliev
<i(dot)kurbangaliev(at)postgrespro(dot)ru> wrote:
> On Fri, 8 Dec 2017 15:12:42 -0500
> Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> Maybe a better idea is ALTER COLUMN SET COMPRESSION x1, x2, x3 ...
>> meaning that x1 is the default for new tuples but x2, x3, etc. are
>> still allowed if present. If you issue a command that only adds
>> things to the list, no table rewrite happens, but if you remove
>> anything, then it does.
>
> I like this idea, but maybe it should be something like ALTER COLUMN
> SET COMPRESSION x1 [ PRESERVE x2, x3 ]? 'PRESERVE' is already used in
> syntax and this syntax will show better which one is current and which
> ones should be kept.

Sure, that works. And I think pglz should exist in the catalog as a
predefined, undroppable compression algorithm. So the default for
each column initially is:

SET COMPRESSION pglz

And if you want to rewrite the table with your awesome custom thing, you can do

SET COMPRESSION awesome

But if you want to just use the awesome custom thing for new rows, you can do

SET COMPRESSION awesome PRESERVE pglz

Then we can get all the dependencies right, pg_upgrade works, users
have total control of rewrite behavior, and everything is great. :-)

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Евгений Шишкин <itparanoia(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Oleg Bartunov <obartunov(at)gmail(dot)com>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2017-12-11 17:41:05
Message-ID:	CAPpHfdsO0wSy6ZEoOQS5AkxTRXeCzv4uz=S8hOOJNgOxPmJW4A@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Dec 11, 2017 at 8:25 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> On Mon, Dec 11, 2017 at 7:55 AM, Ildus Kurbangaliev
> <i(dot)kurbangaliev(at)postgrespro(dot)ru> wrote:
> > On Fri, 8 Dec 2017 15:12:42 -0500
> > Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> >> Maybe a better idea is ALTER COLUMN SET COMPRESSION x1, x2, x3 ...
> >> meaning that x1 is the default for new tuples but x2, x3, etc. are
> >> still allowed if present. If you issue a command that only adds
> >> things to the list, no table rewrite happens, but if you remove
> >> anything, then it does.
> >
> > I like this idea, but maybe it should be something like ALTER COLUMN
> > SET COMPRESSION x1 [ PRESERVE x2, x3 ]? 'PRESERVE' is already used in
> > syntax and this syntax will show better which one is current and which
> > ones should be kept.
>
> Sure, that works. And I think pglz should exist in the catalog as a
> predefined, undroppable compression algorithm. So the default for
> each column initially is:
>
> SET COMPRESSION pglz
>
> And if you want to rewrite the table with your awesome custom thing, you
> can do
>
> SET COMPRESSION awesome
>
> But if you want to just use the awesome custom thing for new rows, you can
> do
>
> SET COMPRESSION awesome PRESERVE pglz
>
> Then we can get all the dependencies right, pg_upgrade works, users
> have total control of rewrite behavior, and everything is great. :-)
>

Looks good.

Thus, in your example if user would like to further change awesome
compression for evenbetter compression, she should write.

SET COMPRESSION evenbetter PRESERVE pglz, awesome; -- full list of previous
compression methods

I wonder what should we do if user specifies only part of previous
compression methods? For instance, pglz is specified but awesome is
missing.

SET COMPRESSION evenbetter PRESERVE pglz; -- awesome is missing

I think we should trigger an error in this case. Because query is
specified in form that is assuming to work without table rewrite, but we're
unable to do this without table rewrite.

I also think that we need some way to change compression method for
multiple columns in a single table rewrite. Because it would be way more
efficient than rewriting table for each of columns. So as an alternative of

ALTER TABLE tbl ALTER COLUMN c1 SET COMPRESSION awesome; -- first table
rewrite
ALTER TABLE tbl ALTER COLUMN c2 SET COMPRESSION awesome; -- second table
rewrite

we could also provide

ALTER TABLE tbl ALTER COLUMN c1 SET COMPRESSION awesome PRESERVE pglz; --
no rewrite
ALTER TABLE tbl ALTER COLUMN c2 SET COMPRESSION awesome PRESERVE pglz; --
no rewrite
VACUUM FULL tbl RESET COMPRESSION PRESERVE c1, c2; -- rewrite with
recompression of c1 and c2 and removing depedencies

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Cc:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Евгений Шишкин <itparanoia(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Oleg Bartunov <obartunov(at)gmail(dot)com>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2017-12-11 17:46:20
Message-ID:	CA+TgmoYqG2DFHDEM9ixMQA0p9d=_w6pYaa-+Jue4PP56VNSCWQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Dec 11, 2017 at 12:41 PM, Alexander Korotkov
<a(dot)korotkov(at)postgrespro(dot)ru> wrote:
> Thus, in your example if user would like to further change awesome
> compression for evenbetter compression, she should write.
>
> SET COMPRESSION evenbetter PRESERVE pglz, awesome; -- full list of previous
> compression methods

Right.

> I wonder what should we do if user specifies only part of previous
> compression methods? For instance, pglz is specified but awesome is
> missing.
>
> SET COMPRESSION evenbetter PRESERVE pglz; -- awesome is missing
>
> I think we should trigger an error in this case. Because query is specified
> in form that is assuming to work without table rewrite, but we're unable to
> do this without table rewrite.

I think that should just rewrite the table in that case. PRESERVE
should specify the things that are allowed to be preserved -- its mere
presence should not be read to preclude a rewrite. And it's
completely reasonable for someone to want to do this, if they are
thinking about de-installing awesome.

> I also think that we need some way to change compression method for multiple
> columns in a single table rewrite. Because it would be way more efficient
> than rewriting table for each of columns. So as an alternative of
>
> ALTER TABLE tbl ALTER COLUMN c1 SET COMPRESSION awesome; -- first table
> rewrite
> ALTER TABLE tbl ALTER COLUMN c2 SET COMPRESSION awesome; -- second table
> rewrite
>
> we could also provide
>
> ALTER TABLE tbl ALTER COLUMN c1 SET COMPRESSION awesome PRESERVE pglz; -- no
> rewrite
> ALTER TABLE tbl ALTER COLUMN c2 SET COMPRESSION awesome PRESERVE pglz; -- no
> rewrite
> VACUUM FULL tbl RESET COMPRESSION PRESERVE c1, c2; -- rewrite with
> recompression of c1 and c2 and removing depedencies
>
> ?

Hmm. ALTER TABLE allows multi comma-separated subcommands, so I don't
think we need to drag VACUUM into this. The user can just say:

ALTER TABLE tbl ALTER COLUMN c1 SET COMPRESSION awesome, ALTER COLUMN
c2 SET COMPRESSION awesome;

If this is properly integrated into tablecmds.c, that should cause a
single rewrite affecting both columns.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Евгений Шишкин <itparanoia(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Oleg Bartunov <obartunov(at)gmail(dot)com>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2017-12-11 18:06:14
Message-ID:	CAPpHfduP77qxdETcPRc_FGOVJpeSCwNGoSUgNU3BCHXu57RVMA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Dec 11, 2017 at 8:46 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> On Mon, Dec 11, 2017 at 12:41 PM, Alexander Korotkov
> <a(dot)korotkov(at)postgrespro(dot)ru> wrote:
> > Thus, in your example if user would like to further change awesome
> > compression for evenbetter compression, she should write.
> >
> > SET COMPRESSION evenbetter PRESERVE pglz, awesome; -- full list of
> previous
> > compression methods
>
> Right.
>
> > I wonder what should we do if user specifies only part of previous
> > compression methods? For instance, pglz is specified but awesome is
> > missing.
> >
> > SET COMPRESSION evenbetter PRESERVE pglz; -- awesome is missing
> >
> > I think we should trigger an error in this case. Because query is
> specified
> > in form that is assuming to work without table rewrite, but we're unable
> to
> > do this without table rewrite.
>
> I think that should just rewrite the table in that case. PRESERVE
> should specify the things that are allowed to be preserved -- its mere
> presence should not be read to preclude a rewrite. And it's
> completely reasonable for someone to want to do this, if they are
> thinking about de-installing awesome.
>

OK, but NOTICE that presumably unexpected table rewrite takes place could
be still useful.

Also we probably should add some view that will expose compression methods
whose are currently preserved for columns. So that user can correctly
construct SET COMPRESSION query that doesn't rewrites table without digging
into internals (like directly querying pg_depend).

> I also think that we need some way to change compression method for
> multiple
> > columns in a single table rewrite. Because it would be way more
> efficient
> > than rewriting table for each of columns. So as an alternative of
> >
> > ALTER TABLE tbl ALTER COLUMN c1 SET COMPRESSION awesome; -- first table
> > rewrite
> > ALTER TABLE tbl ALTER COLUMN c2 SET COMPRESSION awesome; -- second table
> > rewrite
> >
> > we could also provide
> >
> > ALTER TABLE tbl ALTER COLUMN c1 SET COMPRESSION awesome PRESERVE pglz;
> -- no
> > rewrite
> > ALTER TABLE tbl ALTER COLUMN c2 SET COMPRESSION awesome PRESERVE pglz;
> -- no
> > rewrite
> > VACUUM FULL tbl RESET COMPRESSION PRESERVE c1, c2; -- rewrite with
> > recompression of c1 and c2 and removing depedencies
> >
> > ?
>
> Hmm. ALTER TABLE allows multi comma-separated subcommands, so I don't
> think we need to drag VACUUM into this. The user can just say:
>
> ALTER TABLE tbl ALTER COLUMN c1 SET COMPRESSION awesome, ALTER COLUMN
> c2 SET COMPRESSION awesome;
>
> If this is properly integrated into tablecmds.c, that should cause a
> single rewrite affecting both columns.

OK. Sorry, I didn't notice we can use multiple subcommands for ALTER TABLE
in this case...

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [HACKERS] Custom compression methods
Date:	2017-12-11 19:53:29
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

I see there's an ongoing discussion about the syntax and ALTER TABLE
behavior when changing a compression method for a column. So the patch
seems to be on the way to be ready in the January CF, I guess.

But let me play the devil's advocate for a while and question the
usefulness of this approach to compression. Some of the questions were
mentioned in the thread before, but I don't think they got the attention
they deserve.

FWIW I don't know the answers, but I think it's important to ask them.
Also, apologies if this post looks to be against the patch - that's part
of the "devil's advocate" thing.

The main question I'm asking myself is what use cases the patch
addresses, and whether there is a better way to do that. I see about
three main use-cases:

1) Replacing the algorithm used to compress all varlena types (in a way
that makes it transparent for the data type code).

2) Custom datatype-aware compression (e.g. the tsvector).

3) Custom datatype-aware compression with additional column-specific
metadata (e.g. the jsonb with external dictionary).

Now, let's discuss those use cases one by one, and see if there are
simpler (or better in some way) solutions ...

Replacing the algorithm used to compress all varlena values (in a way
that makes it transparent for the data type code).
----------------------------------------------------------------------

While pglz served us well over time, it was repeatedly mentioned that in
some cases it becomes the bottleneck. So supporting other state of the
art compression algorithms seems like a good idea, and this patch is one
way to do that.

But perhaps we should simply make it an initdb option (in which case the
whole cluster would simply use e.g. lz4 instead of pglz)?

That seems like a much simpler approach - it would only require some
./configure options to add --with-lz4 (and other compression libraries),
an initdb option to pick compression algorithm, and probably noting the
choice in cluster controldata.

No dependencies tracking, no ALTER TABLE issues, etc.

Of course, it would not allow using different compression algorithms for
different columns (although it might perhaps allow different compression
level, to some extent).

Conclusion: If we want to offer a simple cluster-wide pglz alternative,
perhaps this patch is not the right way to do that.

Custom datatype-aware compression (e.g. the tsvector)
----------------------------------------------------------------------

Exploiting knowledge of the internal data type structure is a promising
way to improve compression ratio and/or performance.

The obvious question of course is why shouldn't this be done by the data
type code directly, which would also allow additional benefits like
operating directly on the compressed values.

Another thing is that if the datatype representation changes in some
way, the compression method has to change too. So it's tightly coupled
to the datatype anyway.

This does not really require any new infrastructure, all the pieces are
already there.

In some cases that may not be quite possible - the datatype may not be
flexible enough to support alternative (compressed) representation, e.g.
because there are no bits available for "compressed" flag, etc.

Conclusion: IMHO if we want to exploit the knowledge of the data type
internal structure, perhaps doing that in the datatype code directly
would be a better choice.

Custom datatype-aware compression with additional column-specific
metadata (e.g. the jsonb with external dictionary).
----------------------------------------------------------------------

Exploiting redundancy in multiple values in the same column (instead of
compressing them independently) is another attractive way to help the
compression. It is inherently datatype-aware, but currently can't be
implemented directly in datatype code as there's no concept of
column-specific storage (e.g. to store dictionary shared by all values
in a particular column).

I believe any patch addressing this use case would have to introduce
such column-specific storage, and any solution doing that would probably
need to introduce the same catalogs, etc.

The obvious disadvantage of course is that we need to decompress the
varlena value before doing pretty much anything with it, because the
datatype is not aware of the compression.

So I wonder if the patch should instead provide infrastructure for doing
that in the datatype code directly.

The other question is if the patch should introduce some infrastructure
for handling the column context (e.g. column dictionary). Right now,
whoever implements the compression has to implement this bit too.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [HACKERS] Custom compression methods
Date:	2017-12-12 11:41:42
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, 11 Dec 2017 20:53:29 +0100
Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:

> But let me play the devil's advocate for a while and question the
> usefulness of this approach to compression. Some of the questions were
> mentioned in the thread before, but I don't think they got the
> attention they deserve.

Hi. I will try to explain why this approach could be better than others.

>
>
> Replacing the algorithm used to compress all varlena values (in a way
> that makes it transparent for the data type code).
> ----------------------------------------------------------------------
>
> While pglz served us well over time, it was repeatedly mentioned that
> in some cases it becomes the bottleneck. So supporting other state of
> the art compression algorithms seems like a good idea, and this patch
> is one way to do that.
>
> But perhaps we should simply make it an initdb option (in which case
> the whole cluster would simply use e.g. lz4 instead of pglz)?
>
> That seems like a much simpler approach - it would only require some
> ./configure options to add --with-lz4 (and other compression
> libraries), an initdb option to pick compression algorithm, and
> probably noting the choice in cluster controldata.

Replacing pglz for all varlena values wasn't the goal of the patch, but
it's possible to do with it and I think that's good. And as Robert
mentioned pglz could appear as builtin undroppable compresssion method
so the others could be added too. And in the future it can open the
ways to specify compression for specific database or cluster.

>
> Custom datatype-aware compression (e.g. the tsvector)
> ----------------------------------------------------------------------
>
> Exploiting knowledge of the internal data type structure is a
> promising way to improve compression ratio and/or performance.
>
> The obvious question of course is why shouldn't this be done by the
> data type code directly, which would also allow additional benefits
> like operating directly on the compressed values.
>
> Another thing is that if the datatype representation changes in some
> way, the compression method has to change too. So it's tightly coupled
> to the datatype anyway.
>
> This does not really require any new infrastructure, all the pieces
> are already there.
>
> In some cases that may not be quite possible - the datatype may not be
> flexible enough to support alternative (compressed) representation,
> e.g. because there are no bits available for "compressed" flag, etc.
>
> Conclusion: IMHO if we want to exploit the knowledge of the data type
> internal structure, perhaps doing that in the datatype code directly
> would be a better choice.

It could be, but let's imagine there will be internal compression for
tsvector. It means that tsvector has two formats now and minus one bit
somewhere in the header. After a while we found a better compression
but we can't add it because there is already one and it's not good to
have three different formats for one type. Or, the compression methods
were implemented and we decided to use dictionaries for tsvector (if
the user going to store limited number of words). But it will mean that
tsvector will go two compression stages (for its internal and for
dictionaries).

>
>
> Custom datatype-aware compression with additional column-specific
> metadata (e.g. the jsonb with external dictionary).
> ----------------------------------------------------------------------
>
> Exploiting redundancy in multiple values in the same column (instead
> of compressing them independently) is another attractive way to help
> the compression. It is inherently datatype-aware, but currently can't
> be implemented directly in datatype code as there's no concept of
> column-specific storage (e.g. to store dictionary shared by all values
> in a particular column).
>
> I believe any patch addressing this use case would have to introduce
> such column-specific storage, and any solution doing that would
> probably need to introduce the same catalogs, etc.
>
> The obvious disadvantage of course is that we need to decompress the
> varlena value before doing pretty much anything with it, because the
> datatype is not aware of the compression.
>
> So I wonder if the patch should instead provide infrastructure for
> doing that in the datatype code directly.
>
> The other question is if the patch should introduce some
> infrastructure for handling the column context (e.g. column
> dictionary). Right now, whoever implements the compression has to
> implement this bit too.

Column specific storage sounds optional to me. For example compressing
timestamp[] using some delta compression will not require it.

--
----
Regards,
Ildus Kurbangaliev

From:	Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
To:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>
Cc:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2017-12-12 15:07:11
Message-ID:	CAPpHfdumK-HDS-w6Zm7zXDFKJe+xmjKonda_22GbrBrqnrm3TA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi!

Let me add my two cents too.

On Tue, Dec 12, 2017 at 2:41 PM, Ildus Kurbangaliev <
i(dot)kurbangaliev(at)postgrespro(dot)ru> wrote:

> On Mon, 11 Dec 2017 20:53:29 +0100 Tomas Vondra <
> tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> > Replacing the algorithm used to compress all varlena values (in a way
> > that makes it transparent for the data type code).
> > ----------------------------------------------------------------------
> >
> > While pglz served us well over time, it was repeatedly mentioned that
> > in some cases it becomes the bottleneck. So supporting other state of
> > the art compression algorithms seems like a good idea, and this patch
> > is one way to do that.
> >
> > But perhaps we should simply make it an initdb option (in which case
> > the whole cluster would simply use e.g. lz4 instead of pglz)?
> >
> > That seems like a much simpler approach - it would only require some
> > ./configure options to add --with-lz4 (and other compression
> > libraries), an initdb option to pick compression algorithm, and
> > probably noting the choice in cluster controldata.
>
> Replacing pglz for all varlena values wasn't the goal of the patch, but
> it's possible to do with it and I think that's good. And as Robert
> mentioned pglz could appear as builtin undroppable compresssion method
> so the others could be added too. And in the future it can open the
> ways to specify compression for specific database or cluster.
>

Yes, usage of custom compression methods to replace generic non
type-specific compression method is really not the primary goal of this
patch. However, I would consider that as useful side effect. However,
even in this case I see some advantages of custom compression methods over
initdb option.

1) In order to support alternative compression methods in initdb, we have
to provide builtin support for them. Then we immediately run into
dependencies/incompatible-licenses problem. Also, we tie appearance of new
compression methods to our release cycle. In real life, flexibility means
a lot. Giving users freedom to experiment with various compression
libraries without having to recompile PostgreSQL core is great.
2) It's not necessary that users would be satisfied with applying single
compression method to the whole database cluster. Various columns may have
different data distributions with different workloads. Optimal compression
type for one column is not necessary optimal for another column.
3) Possibility to change compression method on the fly without re-initdb is
very good too.

> Custom datatype-aware compression (e.g. the tsvector)
> > ----------------------------------------------------------------------
> >
> > Exploiting knowledge of the internal data type structure is a
> > promising way to improve compression ratio and/or performance.
> >
> > The obvious question of course is why shouldn't this be done by the
> > data type code directly, which would also allow additional benefits
> > like operating directly on the compressed values.
> >
> > Another thing is that if the datatype representation changes in some
> > way, the compression method has to change too. So it's tightly coupled
> > to the datatype anyway.
> >
> > This does not really require any new infrastructure, all the pieces
> > are already there.
> >
> > In some cases that may not be quite possible - the datatype may not be
> > flexible enough to support alternative (compressed) representation,
> > e.g. because there are no bits available for "compressed" flag, etc.
> >
> > Conclusion: IMHO if we want to exploit the knowledge of the data type
> > internal structure, perhaps doing that in the datatype code directly
> > would be a better choice.
>
> It could be, but let's imagine there will be internal compression for
> tsvector. It means that tsvector has two formats now and minus one bit
> somewhere in the header. After a while we found a better compression
> but we can't add it because there is already one and it's not good to
> have three different formats for one type. Or, the compression methods
> were implemented and we decided to use dictionaries for tsvector (if
> the user going to store limited number of words). But it will mean that
> tsvector will go two compression stages (for its internal and for
> dictionaries).

I would like to add that even for single datatype various compression
methods may have different tradeoffs. For instance, one compression method
can have better compression ratio, but another one have faster
decompression. And it's OK for user to choose different compression
methods for different columns.

Depending extensions on datatype internal representation doesn't seem evil
for me. We already have bunch of extension depending on much more deeper
guts of PostgreSQL. On major release of PostgreSQL, extensions must adopt
the changes, that is the rule. And note, the datatype internal
representation alters relatively rare in comparison with other internals,
because it's related to on-disk format and ability to pg_upgrade.

> Custom datatype-aware compression with additional column-specific
> > metadata (e.g. the jsonb with external dictionary).
> > ----------------------------------------------------------------------
> >
> > Exploiting redundancy in multiple values in the same column (instead
> > of compressing them independently) is another attractive way to help
> > the compression. It is inherently datatype-aware, but currently can't
> > be implemented directly in datatype code as there's no concept of
> > column-specific storage (e.g. to store dictionary shared by all values
> > in a particular column).
> >
> > I believe any patch addressing this use case would have to introduce
> > such column-specific storage, and any solution doing that would
> > probably need to introduce the same catalogs, etc.
> >
> > The obvious disadvantage of course is that we need to decompress the
> > varlena value before doing pretty much anything with it, because the
> > datatype is not aware of the compression.
> >
> > So I wonder if the patch should instead provide infrastructure for
> > doing that in the datatype code directly.
> >
> > The other question is if the patch should introduce some
> > infrastructure for handling the column context (e.g. column
> > dictionary). Right now, whoever implements the compression has to
> > implement this bit too.
>
> Column specific storage sounds optional to me. For example compressing
> timestamp[] using some delta compression will not require it.

It's also could be useful to have custom compression method with fixed (not
dynamically complemented) dictionary. See [1] for example what other
databases do. We may specify fixed dictionary directly in the compression
method options, I see no problems. We may also compress that way not only
jsonb or other special data types, but also natural language texts. Using
fixed dictionaries for natural language we can effectively compress short
texts, when lz and other generic compression methods don't have enough of
information to effectively train per-value dictionary.

For sure, further work to improve infrastructure is required including
per-column storage for dictionary and tighter integration between
compression method and datatype. However, we are typically deal with such
complex tasks in step-by-step approach. And I'm not convinced that custom
compression methods are bad for the first step in this direction. For me
they look clear and already very useful in this shape.

1.
https://www.percona.com/doc/percona-server/LATEST/flexibility/compressed_columns.html

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

From:	Oleg Bartunov <obartunov(at)gmail(dot)com>
To:	Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Cc:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2017-12-12 20:08:04
Message-ID:	CAF4Au4xop7FqhCKgabYWymUS0yUk9i=bonPnmVUBbpoKsFYnLA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Dec 12, 2017 at 6:07 PM, Alexander Korotkov
<a(dot)korotkov(at)postgrespro(dot)ru> wrote:
> Hi!
>
> Let me add my two cents too.
>
> On Tue, Dec 12, 2017 at 2:41 PM, Ildus Kurbangaliev
> <i(dot)kurbangaliev(at)postgrespro(dot)ru> wrote:
>>
>> On Mon, 11 Dec 2017 20:53:29 +0100 Tomas Vondra
>> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> > Replacing the algorithm used to compress all varlena values (in a way
>> > that makes it transparent for the data type code).
>> > ----------------------------------------------------------------------
>> >
>> > While pglz served us well over time, it was repeatedly mentioned that
>> > in some cases it becomes the bottleneck. So supporting other state of
>> > the art compression algorithms seems like a good idea, and this patch
>> > is one way to do that.
>> >
>> > But perhaps we should simply make it an initdb option (in which case
>> > the whole cluster would simply use e.g. lz4 instead of pglz)?
>> >
>> > That seems like a much simpler approach - it would only require some
>> > ./configure options to add --with-lz4 (and other compression
>> > libraries), an initdb option to pick compression algorithm, and
>> > probably noting the choice in cluster controldata.
>>
>> Replacing pglz for all varlena values wasn't the goal of the patch, but
>> it's possible to do with it and I think that's good. And as Robert
>> mentioned pglz could appear as builtin undroppable compresssion method
>> so the others could be added too. And in the future it can open the
>> ways to specify compression for specific database or cluster.
>
>
> Yes, usage of custom compression methods to replace generic non
> type-specific compression method is really not the primary goal of this
> patch. However, I would consider that as useful side effect. However, even
> in this case I see some advantages of custom compression methods over initdb
> option.
>
> 1) In order to support alternative compression methods in initdb, we have to
> provide builtin support for them. Then we immediately run into
> dependencies/incompatible-licenses problem. Also, we tie appearance of new
> compression methods to our release cycle. In real life, flexibility means a
> lot. Giving users freedom to experiment with various compression libraries
> without having to recompile PostgreSQL core is great.
> 2) It's not necessary that users would be satisfied with applying single
> compression method to the whole database cluster. Various columns may have
> different data distributions with different workloads. Optimal compression
> type for one column is not necessary optimal for another column.
> 3) Possibility to change compression method on the fly without re-initdb is
> very good too.

I consider custom compression as the way to custom TOAST. For example,
to optimal access
parts of very long document we need to compress slices of document.
Currently, long jsonb
document get compressed and then sliced and that killed all benefits
of binary jsonb. Also, we are
thinking about "lazy" access to the parts of jsonb from pl's, which is
currently awfully unefficient.

>
>> > Custom datatype-aware compression (e.g. the tsvector)
>> > ----------------------------------------------------------------------
>> >
>> > Exploiting knowledge of the internal data type structure is a
>> > promising way to improve compression ratio and/or performance.
>> >
>> > The obvious question of course is why shouldn't this be done by the
>> > data type code directly, which would also allow additional benefits
>> > like operating directly on the compressed values.
>> >
>> > Another thing is that if the datatype representation changes in some
>> > way, the compression method has to change too. So it's tightly coupled
>> > to the datatype anyway.
>> >
>> > This does not really require any new infrastructure, all the pieces
>> > are already there.
>> >
>> > In some cases that may not be quite possible - the datatype may not be
>> > flexible enough to support alternative (compressed) representation,
>> > e.g. because there are no bits available for "compressed" flag, etc.
>> >
>> > Conclusion: IMHO if we want to exploit the knowledge of the data type
>> > internal structure, perhaps doing that in the datatype code directly
>> > would be a better choice.
>>
>> It could be, but let's imagine there will be internal compression for
>> tsvector. It means that tsvector has two formats now and minus one bit
>> somewhere in the header. After a while we found a better compression
>> but we can't add it because there is already one and it's not good to
>> have three different formats for one type. Or, the compression methods
>> were implemented and we decided to use dictionaries for tsvector (if
>> the user going to store limited number of words). But it will mean that
>> tsvector will go two compression stages (for its internal and for
>> dictionaries).
>
>
> I would like to add that even for single datatype various compression
> methods may have different tradeoffs. For instance, one compression method
> can have better compression ratio, but another one have faster
> decompression. And it's OK for user to choose different compression methods
> for different columns.
>
> Depending extensions on datatype internal representation doesn't seem evil
> for me. We already have bunch of extension depending on much more deeper
> guts of PostgreSQL. On major release of PostgreSQL, extensions must adopt
> the changes, that is the rule. And note, the datatype internal
> representation alters relatively rare in comparison with other internals,
> because it's related to on-disk format and ability to pg_upgrade.
>
>> > Custom datatype-aware compression with additional column-specific
>> > metadata (e.g. the jsonb with external dictionary).
>> > ----------------------------------------------------------------------
>> >
>> > Exploiting redundancy in multiple values in the same column (instead
>> > of compressing them independently) is another attractive way to help
>> > the compression. It is inherently datatype-aware, but currently can't
>> > be implemented directly in datatype code as there's no concept of
>> > column-specific storage (e.g. to store dictionary shared by all values
>> > in a particular column).
>> >
>> > I believe any patch addressing this use case would have to introduce
>> > such column-specific storage, and any solution doing that would
>> > probably need to introduce the same catalogs, etc.
>> >
>> > The obvious disadvantage of course is that we need to decompress the
>> > varlena value before doing pretty much anything with it, because the
>> > datatype is not aware of the compression.
>> >
>> > So I wonder if the patch should instead provide infrastructure for
>> > doing that in the datatype code directly.
>> >
>> > The other question is if the patch should introduce some
>> > infrastructure for handling the column context (e.g. column
>> > dictionary). Right now, whoever implements the compression has to
>> > implement this bit too.
>>
>> Column specific storage sounds optional to me. For example compressing
>> timestamp[] using some delta compression will not require it.
>
>
> It's also could be useful to have custom compression method with fixed (not
> dynamically complemented) dictionary. See [1] for example what other
> databases do. We may specify fixed dictionary directly in the compression
> method options, I see no problems. We may also compress that way not only
> jsonb or other special data types, but also natural language texts. Using
> fixed dictionaries for natural language we can effectively compress short
> texts, when lz and other generic compression methods don't have enough of
> information to effectively train per-value dictionary.
>
> For sure, further work to improve infrastructure is required including
> per-column storage for dictionary and tighter integration between
> compression method and datatype. However, we are typically deal with such
> complex tasks in step-by-step approach. And I'm not convinced that custom
> compression methods are bad for the first step in this direction. For me
> they look clear and already very useful in this shape.

>
> 1.
> https://www.percona.com/doc/percona-server/LATEST/flexibility/compressed_columns.html
>
> ------
> Alexander Korotkov
> Postgres Professional: http://www.postgrespro.com
> The Russian Postgres Company
>

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Cc:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Евгений Шишкин <itparanoia(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Oleg Bartunov <obartunov(at)gmail(dot)com>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2017-12-12 20:52:01
Message-ID:	CA+TgmoZqJYP+othk0c-r6acjVWWvBoxkWm5eHLUE=8VyezQKKw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Dec 11, 2017 at 1:06 PM, Alexander Korotkov
<a(dot)korotkov(at)postgrespro(dot)ru> wrote:
> OK, but NOTICE that presumably unexpected table rewrite takes place could be
> still useful.

I'm not going to complain too much about that, but I think that's
mostly a failure of expectation rather than a real problem. If the
documentation says what the user should expect, and they expect
something else, tough luck for them.

> Also we probably should add some view that will expose compression methods
> whose are currently preserved for columns. So that user can correctly
> construct SET COMPRESSION query that doesn't rewrites table without digging
> into internals (like directly querying pg_depend).

Yes. I wonder if \d or \d+ can show it somehow.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2017-12-12 21:33:48
Message-ID:	CA+Tgmoax3Hz3ZLupxXofurFbpgKBEwe8qd7N65C7G6xoE=xQTw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Dec 11, 2017 at 2:53 PM, Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> But let me play the devil's advocate for a while and question the
> usefulness of this approach to compression. Some of the questions were
> mentioned in the thread before, but I don't think they got the attention
> they deserve.

Sure, thanks for chiming in. I think it is good to make sure we are
discussing this stuff.

> But perhaps we should simply make it an initdb option (in which case the
> whole cluster would simply use e.g. lz4 instead of pglz)?
>
> That seems like a much simpler approach - it would only require some
> ./configure options to add --with-lz4 (and other compression libraries),
> an initdb option to pick compression algorithm, and probably noting the
> choice in cluster controldata.
>
> No dependencies tracking, no ALTER TABLE issues, etc.
>
> Of course, it would not allow using different compression algorithms for
> different columns (although it might perhaps allow different compression
> level, to some extent).
>
> Conclusion: If we want to offer a simple cluster-wide pglz alternative,
> perhaps this patch is not the right way to do that.

I actually disagree with your conclusion here. I mean, if you do it
that way, then it has the same problem as checksums: changing
compression algorithms requires a full dump-and-reload of the
database, which makes it more or less a non-starter for large
databases. On the other hand, with the infrastructure provided by
this patch, we can have a default_compression_method GUC that will be
set to 'pglz' initially. If the user changes it to 'lz4', or we ship
a new release where the new default is 'lz4', then new tables created
will use that new setting, but the existing stuff keeps working. If
you want to upgrade your existing tables to use lz4 rather than pglz,
you can change the compression option for those columns to COMPRESS
lz4 PRESERVE pglz if you want to do it incrementally or just COMPRESS
lz4 to force a rewrite of an individual table. That's really
powerful, and I think users will like it a lot.

In short, your approach, while perhaps a little simpler to code, seems
like it is fraught with operational problems which this design avoids.

> Custom datatype-aware compression (e.g. the tsvector)
> ----------------------------------------------------------------------
>
> Exploiting knowledge of the internal data type structure is a promising
> way to improve compression ratio and/or performance.
>
> The obvious question of course is why shouldn't this be done by the data
> type code directly, which would also allow additional benefits like
> operating directly on the compressed values.
>
> Another thing is that if the datatype representation changes in some
> way, the compression method has to change too. So it's tightly coupled
> to the datatype anyway.
>
> This does not really require any new infrastructure, all the pieces are
> already there.
>
> In some cases that may not be quite possible - the datatype may not be
> flexible enough to support alternative (compressed) representation, e.g.
> because there are no bits available for "compressed" flag, etc.
>
> Conclusion: IMHO if we want to exploit the knowledge of the data type
> internal structure, perhaps doing that in the datatype code directly
> would be a better choice.

I definitely think there's a place for compression built right into
the data type. I'm still happy about commit
145343534c153d1e6c3cff1fa1855787684d9a38 -- although really, more
needs to be done there. But that type of improvement and what is
proposed here are basically orthogonal. Having either one is good;
having both is better.

I think there may also be a place for declaring that a particular data
type has a "privileged" type of TOAST compression; if you use that
kind of compression for that data type, the data type will do smart
things, and if not, it will have to decompress in more cases. But I
think this infrastructure makes that kind of thing easier, not harder.

> Custom datatype-aware compression with additional column-specific
> metadata (e.g. the jsonb with external dictionary).
> ----------------------------------------------------------------------
>
> Exploiting redundancy in multiple values in the same column (instead of
> compressing them independently) is another attractive way to help the
> compression. It is inherently datatype-aware, but currently can't be
> implemented directly in datatype code as there's no concept of
> column-specific storage (e.g. to store dictionary shared by all values
> in a particular column).
>
> I believe any patch addressing this use case would have to introduce
> such column-specific storage, and any solution doing that would probably
> need to introduce the same catalogs, etc.
>
> The obvious disadvantage of course is that we need to decompress the
> varlena value before doing pretty much anything with it, because the
> datatype is not aware of the compression.
>
> So I wonder if the patch should instead provide infrastructure for doing
> that in the datatype code directly.
>
> The other question is if the patch should introduce some infrastructure
> for handling the column context (e.g. column dictionary). Right now,
> whoever implements the compression has to implement this bit too.

I agree that having a place to store a per-column compression
dictionary would be awesome, but I think that could be added later on
top of this infrastructure. For example, suppose we stored each
per-column compression dictionary in a separate file and provided some
infrastructure for WAL-logging changes to the file on a logical basis
and checkpointing those updates. Then we wouldn't be tied to the
MVCC/transactional issues which storing the blobs in a table would
have, which seems like a big win. Of course, it also creates a lot of
little tiny files inside a directory that already tends to have too
many files, but maybe with some more work we can figure out a way
around that problem. Here again, it seems to me that the proposed
design is going more in the right direction than the wrong direction:
if some day we have per-column dictionaries, they will need to be tied
to specific compression methods on specific columns. If we already
have that concept, extending it to do something new is easier than if
we have to create it from scratch.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2017-12-12 22:07:46
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 12/12/2017 10:33 PM, Robert Haas wrote:
> On Mon, Dec 11, 2017 at 2:53 PM, Tomas Vondra
> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> But let me play the devil's advocate for a while and question the
>> usefulness of this approach to compression. Some of the questions were
>> mentioned in the thread before, but I don't think they got the attention
>> they deserve.
>
> Sure, thanks for chiming in. I think it is good to make sure we are
> discussing this stuff.
>
>> But perhaps we should simply make it an initdb option (in which case the
>> whole cluster would simply use e.g. lz4 instead of pglz)?
>>
>> That seems like a much simpler approach - it would only require some
>> ./configure options to add --with-lz4 (and other compression libraries),
>> an initdb option to pick compression algorithm, and probably noting the
>> choice in cluster controldata.
>>
>> No dependencies tracking, no ALTER TABLE issues, etc.
>>
>> Of course, it would not allow using different compression algorithms for
>> different columns (although it might perhaps allow different compression
>> level, to some extent).
>>
>> Conclusion: If we want to offer a simple cluster-wide pglz alternative,
>> perhaps this patch is not the right way to do that.
>
> I actually disagree with your conclusion here. I mean, if you do it
> that way, then it has the same problem as checksums: changing
> compression algorithms requires a full dump-and-reload of the
> database, which makes it more or less a non-starter for large
> databases. On the other hand, with the infrastructure provided by
> this patch, we can have a default_compression_method GUC that will be
> set to 'pglz' initially. If the user changes it to 'lz4', or we ship
> a new release where the new default is 'lz4', then new tables created
> will use that new setting, but the existing stuff keeps working. If
> you want to upgrade your existing tables to use lz4 rather than pglz,
> you can change the compression option for those columns to COMPRESS
> lz4 PRESERVE pglz if you want to do it incrementally or just COMPRESS
> lz4 to force a rewrite of an individual table. That's really
> powerful, and I think users will like it a lot.
>
> In short, your approach, while perhaps a little simpler to code, seems
> like it is fraught with operational problems which this design avoids.
>

I agree the checksum-like limitations are annoying and make it
impossible to change the compression algorithm after the cluster is
initialized (although I recall a discussion about addressing that).

So yeah, if such flexibility is considered valuable/important, then the
patch is a better solution.

>> Custom datatype-aware compression (e.g. the tsvector)
>> ----------------------------------------------------------------------
>>
>> Exploiting knowledge of the internal data type structure is a promising
>> way to improve compression ratio and/or performance.
>>
>> The obvious question of course is why shouldn't this be done by the data
>> type code directly, which would also allow additional benefits like
>> operating directly on the compressed values.
>>
>> Another thing is that if the datatype representation changes in some
>> way, the compression method has to change too. So it's tightly coupled
>> to the datatype anyway.
>>
>> This does not really require any new infrastructure, all the pieces are
>> already there.
>>
>> In some cases that may not be quite possible - the datatype may not be
>> flexible enough to support alternative (compressed) representation, e.g.
>> because there are no bits available for "compressed" flag, etc.
>>
>> Conclusion: IMHO if we want to exploit the knowledge of the data type
>> internal structure, perhaps doing that in the datatype code directly
>> would be a better choice.
>
> I definitely think there's a place for compression built right into
> the data type. I'm still happy about commit
> 145343534c153d1e6c3cff1fa1855787684d9a38 -- although really, more
> needs to be done there. But that type of improvement and what is
> proposed here are basically orthogonal. Having either one is good;
> having both is better.
>

Why orthogonal?

For example, why couldn't (or shouldn't) the tsvector compression be
done by tsvector code itself? Why should we be doing that at the varlena
level (so that the tsvector code does not even know about it)?

For example we could make the datatype EXTERNAL and do the compression
on our own, using a custom algorithm. Of course, that would require
datatype-specific implementation, but tsvector_compress does that too.

It seems to me the main reason is that tsvector actually does not allow
us to do that, as there's no good way to distinguish the different
internal format (e.g. by storing a flag or format version in some sort
of header, etc.).

> I think there may also be a place for declaring that a particular data
> type has a "privileged" type of TOAST compression; if you use that
> kind of compression for that data type, the data type will do smart
> things, and if not, it will have to decompress in more cases. But I
> think this infrastructure makes that kind of thing easier, not harder.
>

I don't quite understand how that would be done. Isn't TOAST meant to be
entirely transparent for the datatypes? I can imagine custom TOAST
compression (which is pretty much what the patch does, after all), but I
don't see how the datatype could do anything smart about it, because it
has no idea which particular compression was used. And considering the
OIDs of the compression methods do change, I'm not sure that's fixable.

>> Custom datatype-aware compression with additional column-specific
>> metadata (e.g. the jsonb with external dictionary).
>> ----------------------------------------------------------------------
>>
>> Exploiting redundancy in multiple values in the same column (instead of
>> compressing them independently) is another attractive way to help the
>> compression. It is inherently datatype-aware, but currently can't be
>> implemented directly in datatype code as there's no concept of
>> column-specific storage (e.g. to store dictionary shared by all values
>> in a particular column).
>>
>> I believe any patch addressing this use case would have to introduce
>> such column-specific storage, and any solution doing that would probably
>> need to introduce the same catalogs, etc.
>>
>> The obvious disadvantage of course is that we need to decompress the
>> varlena value before doing pretty much anything with it, because the
>> datatype is not aware of the compression.
>>
>> So I wonder if the patch should instead provide infrastructure for doing
>> that in the datatype code directly.
>>
>> The other question is if the patch should introduce some infrastructure
>> for handling the column context (e.g. column dictionary). Right now,
>> whoever implements the compression has to implement this bit too.
>
> I agree that having a place to store a per-column compression
> dictionary would be awesome, but I think that could be added later on
> top of this infrastructure. For example, suppose we stored each
> per-column compression dictionary in a separate file and provided some
> infrastructure for WAL-logging changes to the file on a logical basis
> and checkpointing those updates. Then we wouldn't be tied to the
> MVCC/transactional issues which storing the blobs in a table would
> have, which seems like a big win. Of course, it also creates a lot of
> little tiny files inside a directory that already tends to have too
> many files, but maybe with some more work we can figure out a way
> around that problem. Here again, it seems to me that the proposed
> design is going more in the right direction than the wrong direction:
> if some day we have per-column dictionaries, they will need to be tied
> to specific compression methods on specific columns. If we already
> have that concept, extending it to do something new is easier than if
> we have to create it from scratch.
>

Well, it wasn't my goal to suddenly widen the scope of the patch and
require it adds all these pieces. My intent was more to point to pieces
that need to be filled in the future.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Chapman Flack <chap(at)anastigmatix(dot)net>
To:	pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: [HACKERS] Custom compression methods
Date:	2017-12-12 22:09:58
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 12/12/2017 04:33 PM, Robert Haas wrote:

> you want to upgrade your existing tables to use lz4 rather than pglz,
> you can change the compression option for those columns to COMPRESS
> lz4 PRESERVE pglz if you want to do it incrementally or just COMPRESS

This is a thread I've only been following peripherally, so forgive
a question that's probably covered somewhere upthread: how will this
be done? Surely not with compression-type bits in each tuple? By
remembering a txid where the compression was changed, and the former
algorithm for older txids?

-Chap

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2017-12-13 00:54:52
Message-ID:	CA+TgmoY0OnO8cfKEarPb6AXHGTBh9EeW93cemXhmWfcH0xg+xw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Dec 12, 2017 at 5:07 PM, Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> I definitely think there's a place for compression built right into
>> the data type. I'm still happy about commit
>> 145343534c153d1e6c3cff1fa1855787684d9a38 -- although really, more
>> needs to be done there. But that type of improvement and what is
>> proposed here are basically orthogonal. Having either one is good;
>> having both is better.
>>
> Why orthogonal?

I mean, they are different things. Data types are already free to
invent more compact representations, and that does not preclude
applying pglz to the result.

> For example, why couldn't (or shouldn't) the tsvector compression be
> done by tsvector code itself? Why should we be doing that at the varlena
> level (so that the tsvector code does not even know about it)?

We could do that, but then:

1. The compression algorithm would be hard-coded into the system
rather than changeable. Pluggability has some value.

2. If several data types can benefit from a similar approach, it has
to be separately implemented for each one.

3. Compression is only applied to large-ish values. If you are just
making the data type representation more compact, you probably want to
apply the new representation to all values. If you are compressing in
the sense that the original data gets smaller but harder to interpret,
then you probably only want to apply the technique where the value is
already pretty wide, and maybe respect the user's configured storage
attributes. TOAST knows about some of that kind of stuff.

> It seems to me the main reason is that tsvector actually does not allow
> us to do that, as there's no good way to distinguish the different
> internal format (e.g. by storing a flag or format version in some sort
> of header, etc.).

That is also a potential problem, although I suspect it is possible to
work around it somehow for most data types. It might be annoying,
though.

>> I think there may also be a place for declaring that a particular data
>> type has a "privileged" type of TOAST compression; if you use that
>> kind of compression for that data type, the data type will do smart
>> things, and if not, it will have to decompress in more cases. But I
>> think this infrastructure makes that kind of thing easier, not harder.
>
> I don't quite understand how that would be done. Isn't TOAST meant to be
> entirely transparent for the datatypes? I can imagine custom TOAST
> compression (which is pretty much what the patch does, after all), but I
> don't see how the datatype could do anything smart about it, because it
> has no idea which particular compression was used. And considering the
> OIDs of the compression methods do change, I'm not sure that's fixable.

I don't think TOAST needs to be entirely transparent for the
datatypes. We've already dipped our toe in the water by allowing some
operations on "short" varlenas, and there's really nothing to prevent
a given datatype from going further. The OID problem you mentioned
would presumably be solved by hard-coding the OIDs for any built-in,
privileged compression methods.

> Well, it wasn't my goal to suddenly widen the scope of the patch and
> require it adds all these pieces. My intent was more to point to pieces
> that need to be filled in the future.

Sure, that's fine. I'm not worked up about this, just explaining why
it seems reasonably well-designed to me.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2017-12-13 10:10:46
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 12/13/2017 01:54 AM, Robert Haas wrote:
> On Tue, Dec 12, 2017 at 5:07 PM, Tomas Vondra
> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>>> I definitely think there's a place for compression built right into
>>> the data type. I'm still happy about commit
>>> 145343534c153d1e6c3cff1fa1855787684d9a38 -- although really, more
>>> needs to be done there. But that type of improvement and what is
>>> proposed here are basically orthogonal. Having either one is good;
>>> having both is better.
>>>
>> Why orthogonal?
>
> I mean, they are different things. Data types are already free to
> invent more compact representations, and that does not preclude
> applying pglz to the result.
>
>> For example, why couldn't (or shouldn't) the tsvector compression be
>> done by tsvector code itself? Why should we be doing that at the varlena
>> level (so that the tsvector code does not even know about it)?
>
> We could do that, but then:
>
> 1. The compression algorithm would be hard-coded into the system
> rather than changeable. Pluggability has some value.
>

Sure. I agree extensibility of pretty much all parts is a significant
asset of the project.

> 2. If several data types can benefit from a similar approach, it has
> to be separately implemented for each one.
>

I don't think the current solution improves that, though. If you want to
exploit internal features of individual data types, it pretty much
requires code customized to every such data type.

For example you can't take the tsvector compression and just slap it on
tsquery, because it relies on knowledge of internal tsvector structure.
So you need separate implementations anyway.

> 3. Compression is only applied to large-ish values. If you are just
> making the data type representation more compact, you probably want to
> apply the new representation to all values. If you are compressing in
> the sense that the original data gets smaller but harder to interpret,
> then you probably only want to apply the technique where the value is
> already pretty wide, and maybe respect the user's configured storage
> attributes. TOAST knows about some of that kind of stuff.
>

Good point. One such parameter that I really miss is compression level.
I can imagine tuning it through CREATE COMPRESSION METHOD, but it does
not seem quite possible with compression happening in a datatype.

>> It seems to me the main reason is that tsvector actually does not allow
>> us to do that, as there's no good way to distinguish the different
>> internal format (e.g. by storing a flag or format version in some sort
>> of header, etc.).
>
> That is also a potential problem, although I suspect it is possible to
> work around it somehow for most data types. It might be annoying,
> though.
>
>>> I think there may also be a place for declaring that a particular data
>>> type has a "privileged" type of TOAST compression; if you use that
>>> kind of compression for that data type, the data type will do smart
>>> things, and if not, it will have to decompress in more cases. But I
>>> think this infrastructure makes that kind of thing easier, not harder.
>>
>> I don't quite understand how that would be done. Isn't TOAST meant to be
>> entirely transparent for the datatypes? I can imagine custom TOAST
>> compression (which is pretty much what the patch does, after all), but I
>> don't see how the datatype could do anything smart about it, because it
>> has no idea which particular compression was used. And considering the
>> OIDs of the compression methods do change, I'm not sure that's fixable.
>
> I don't think TOAST needs to be entirely transparent for the
> datatypes. We've already dipped our toe in the water by allowing some
> operations on "short" varlenas, and there's really nothing to prevent
> a given datatype from going further. The OID problem you mentioned
> would presumably be solved by hard-coding the OIDs for any built-in,
> privileged compression methods.
>

Stupid question, but what do you mean by "short" varlenas?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Евгений Шишкин <itparanoia(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Oleg Bartunov <obartunov(at)gmail(dot)com>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Chapman Flack <chap(at)anastigmatix(dot)net>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2017-12-13 12:18:18
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, 12 Dec 2017 15:52:01 -0500
Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

>
> Yes. I wonder if \d or \d+ can show it somehow.
>

Yes, in current version of the patch, \d+ shows current compression.
It can be extended to show a list of current compression methods.

Since we agreed on ALTER syntax, i want to clear things about CREATE.
Should it be CREATE ACCESS METHOD .. TYPE СOMPRESSION or CREATE
COMPRESSION METHOD? I like the access method approach, and it
simplifies the code, but I'm just not sure a compression is an access
method or not.

Current implementation
----------------------

To avoid extra patches I also want to clear things about current
implementation. Right now there are two tables, "pg_compression" and
"pg_compression_opt". When compression method is linked to a column it
creates a record in pg_compression_opt. This record's Oid is stored in
the varlena. These Oids kept in first column so I can move them in
pg_upgrade but in all other aspects they behave like usual Oids. Also
it's easy to restore them.

Compression options linked to a specific column. When tuple is
moved between relations it will be decompressed.

Also in current implementation SET COMPRESSION contains WITH syntax
which is used to provide extra options to compression method.

What could be changed
---------------------

As Alvaro mentioned COMPRESSION METHOD is practically an access method,
so it could be created as CREATE ACCESS METHOD .. TYPE COMPRESSION.
This approach simplifies the patch and "pg_compression" table could be
removed. So compression method is created with something like:

CREATE ACCESS METHOD .. TYPE COMPRESSION HANDLER
awesome_compression_handler;

Syntax of SET COMPRESSION changes to SET COMPRESSION .. PRESERVE which
is useful to control rewrites and for pg_upgrade to make dependencies
between moved compression options and compression methods from pg_am
table.

Default compression is always pglz and if users want to change they run:

ALTER COLUMN <col> SET COMPRESSION awesome PRESERVE pglz;

Without PRESERVE it will rewrite the whole relation using new
compression. Also the rewrite removes all unlisted compression options
so their compresssion methods could be safely dropped.

"pg_compression_opt" table could be renamed to "pg_compression", and
compression options will be stored there.

I'd like to keep extra compression options, for example pglz can be
configured with them. Syntax would be slightly changed:

SET COMPRESSION pglz WITH (min_comp_rate=25) PRESERVE awesome;

Setting the same compression method with different options will create
new compression options record for future tuples but will not
rewrite table.

--
----
Regards,
Ildus Kurbangaliev

From:	Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2017-12-13 16:55:01
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tomas Vondra wrote:

> On 12/13/2017 01:54 AM, Robert Haas wrote:

> > 3. Compression is only applied to large-ish values. If you are just
> > making the data type representation more compact, you probably want to
> > apply the new representation to all values. If you are compressing in
> > the sense that the original data gets smaller but harder to interpret,
> > then you probably only want to apply the technique where the value is
> > already pretty wide, and maybe respect the user's configured storage
> > attributes. TOAST knows about some of that kind of stuff.
>
> Good point. One such parameter that I really miss is compression level.
> I can imagine tuning it through CREATE COMPRESSION METHOD, but it does
> not seem quite possible with compression happening in a datatype.

Hmm, actually isn't that the sort of thing that you would tweak using a
column-level option instead of a compression method?
ALTER TABLE ALTER COLUMN SET (compression_level=123)
The only thing we need for this is to make tuptoaster.c aware of the
need to check for a parameter.

> > I don't think TOAST needs to be entirely transparent for the
> > datatypes. We've already dipped our toe in the water by allowing some
> > operations on "short" varlenas, and there's really nothing to prevent
> > a given datatype from going further. The OID problem you mentioned
> > would presumably be solved by hard-coding the OIDs for any built-in,
> > privileged compression methods.
>
> Stupid question, but what do you mean by "short" varlenas?

Those are varlenas with 1-byte header rather than the standard 4-byte
header.

--
Álvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2017-12-13 18:34:49
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 12/13/2017 05:55 PM, Alvaro Herrera wrote:
> Tomas Vondra wrote:
>
>> On 12/13/2017 01:54 AM, Robert Haas wrote:
>
>>> 3. Compression is only applied to large-ish values. If you are just
>>> making the data type representation more compact, you probably want to
>>> apply the new representation to all values. If you are compressing in
>>> the sense that the original data gets smaller but harder to interpret,
>>> then you probably only want to apply the technique where the value is
>>> already pretty wide, and maybe respect the user's configured storage
>>> attributes. TOAST knows about some of that kind of stuff.
>>
>> Good point. One such parameter that I really miss is compression level.
>> I can imagine tuning it through CREATE COMPRESSION METHOD, but it does
>> not seem quite possible with compression happening in a datatype.
>
> Hmm, actually isn't that the sort of thing that you would tweak using a
> column-level option instead of a compression method?
> ALTER TABLE ALTER COLUMN SET (compression_level=123)
> The only thing we need for this is to make tuptoaster.c aware of the
> need to check for a parameter.
>

Wouldn't that require some universal compression level, shared by all
supported compression algorithms? I don't think there is such thing.

Defining it should not be extremely difficult, although I'm sure there
will be some cumbersome cases. For example what if an algorithm "a"
supports compression levels 0-10, and algorithm "b" only supports 0-3?

You may define 11 "universal" compression levels, and map the four
levels for "b" to that (how). But then everyone has to understand how
that "universal" mapping is defined.

Another issue is that there are algorithms without a compression level
(e.g. pglz does not have one, AFAICS), or with somewhat definition (lz4
does not have levels, and instead has "acceleration" which may be an
arbitrary positive integer, so not really compatible with "universal"
compression level).

So to me the

ALTER TABLE ALTER COLUMN SET (compression_level=123)

seems more like an unnecessary hurdle ...

>>> I don't think TOAST needs to be entirely transparent for the
>>> datatypes. We've already dipped our toe in the water by allowing some
>>> operations on "short" varlenas, and there's really nothing to prevent
>>> a given datatype from going further. The OID problem you mentioned
>>> would presumably be solved by hard-coding the OIDs for any built-in,
>>> privileged compression methods.
>>
>> Stupid question, but what do you mean by "short" varlenas?
>
> Those are varlenas with 1-byte header rather than the standard 4-byte
> header.
>

OK, that's what I thought. But that is still pretty transparent to the
data types, no?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2017-12-14 15:21:44
Message-ID:	CA+TgmoYszJaQivv1eG4JAV3S19t61Y68KLJPNiRUABNWurQANA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Dec 13, 2017 at 5:10 AM, Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> 2. If several data types can benefit from a similar approach, it has
>> to be separately implemented for each one.
>
> I don't think the current solution improves that, though. If you want to
> exploit internal features of individual data types, it pretty much
> requires code customized to every such data type.
>
> For example you can't take the tsvector compression and just slap it on
> tsquery, because it relies on knowledge of internal tsvector structure.
> So you need separate implementations anyway.

I don't think that's necessarily true. Certainly, it's true that *if*
tsvector compression depends on knowledge of internal tsvector
structure, *then* that you can't use the implementation for anything
else (this, by the way, means that there needs to be some way for a
compression method to reject being applied to a column of a data type
it doesn't like). However, it seems possible to imagine compression
algorithms that can work for a variety of data types, too. There
might be a compression algorithm that is theoretically a
general-purpose algorithm but has features which are particularly
well-suited to, say, JSON or XML data, because it looks for word
boundaries to decide on what strings to insert into the compression
dictionary.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2017-12-14 15:25:34
Message-ID:	CA+TgmoYzht_2g8bqwXGB24c24wjvJuOBTkgVhXKhi83OS18GQQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Dec 13, 2017 at 1:34 PM, Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> Wouldn't that require some universal compression level, shared by all
> supported compression algorithms? I don't think there is such thing.
>
> Defining it should not be extremely difficult, although I'm sure there
> will be some cumbersome cases. For example what if an algorithm "a"
> supports compression levels 0-10, and algorithm "b" only supports 0-3?
>
> You may define 11 "universal" compression levels, and map the four
> levels for "b" to that (how). But then everyone has to understand how
> that "universal" mapping is defined.

What we could do is use the "namespace" feature of reloptions to
distinguish options for the column itself from options for the
compression algorithm. Currently namespaces are used only to allow
you to configure toast.whatever = somevalue, but we could let you say
pglz.something = somevalue or lz4.whatever = somevalue. Or maybe, to
avoid confusion -- what happens if somebody invents a compression
method called toast? -- we should do it as compress.lz4.whatever =
somevalue. I think this takes us a bit far afield from the core
purpose of this patch and should be a separate patch at a later time,
but I think it would be cool.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>
Cc:	Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Евгений Шишкин <itparanoia(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Oleg Bartunov <obartunov(at)gmail(dot)com>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Chapman Flack <chap(at)anastigmatix(dot)net>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2017-12-14 15:29:10
Message-ID:	CA+TgmoZ68LyL0rFPDgC1J-6p9ZQZc2xS4Txd-9661s+w6_MwBg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Dec 13, 2017 at 7:18 AM, Ildus Kurbangaliev
<i(dot)kurbangaliev(at)postgrespro(dot)ru> wrote:
> Since we agreed on ALTER syntax, i want to clear things about CREATE.
> Should it be CREATE ACCESS METHOD .. TYPE СOMPRESSION or CREATE
> COMPRESSION METHOD? I like the access method approach, and it
> simplifies the code, but I'm just not sure a compression is an access
> method or not.

+1 for ACCESS METHOD.

> Current implementation
> ----------------------
>
> To avoid extra patches I also want to clear things about current
> implementation. Right now there are two tables, "pg_compression" and
> "pg_compression_opt". When compression method is linked to a column it
> creates a record in pg_compression_opt. This record's Oid is stored in
> the varlena. These Oids kept in first column so I can move them in
> pg_upgrade but in all other aspects they behave like usual Oids. Also
> it's easy to restore them.

pg_compression_opt -> pg_attr_compression, maybe.

> Compression options linked to a specific column. When tuple is
> moved between relations it will be decompressed.

Can we do this only if the compression method isn't OK for the new
column? For example, if the old column is COMPRESS foo PRESERVE bar
and the new column is COMPRESS bar PRESERVE foo, we don't need to
force decompression in any case.

> Also in current implementation SET COMPRESSION contains WITH syntax
> which is used to provide extra options to compression method.

Hmm, that's an alternative to use reloptions. Maybe that's fine.

> What could be changed
> ---------------------
>
> As Alvaro mentioned COMPRESSION METHOD is practically an access method,
> so it could be created as CREATE ACCESS METHOD .. TYPE COMPRESSION.
> This approach simplifies the patch and "pg_compression" table could be
> removed. So compression method is created with something like:
>
> CREATE ACCESS METHOD .. TYPE COMPRESSION HANDLER
> awesome_compression_handler;
>
> Syntax of SET COMPRESSION changes to SET COMPRESSION .. PRESERVE which
> is useful to control rewrites and for pg_upgrade to make dependencies
> between moved compression options and compression methods from pg_am
> table.
>
> Default compression is always pglz and if users want to change they run:
>
> ALTER COLUMN <col> SET COMPRESSION awesome PRESERVE pglz;
>
> Without PRESERVE it will rewrite the whole relation using new
> compression. Also the rewrite removes all unlisted compression options
> so their compresssion methods could be safely dropped.

That all sounds good.

> "pg_compression_opt" table could be renamed to "pg_compression", and
> compression options will be stored there.

See notes above.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2017-12-14 17:23:30
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 12/14/2017 04:21 PM, Robert Haas wrote:
> On Wed, Dec 13, 2017 at 5:10 AM, Tomas Vondra
> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>>> 2. If several data types can benefit from a similar approach, it has
>>> to be separately implemented for each one.
>>
>> I don't think the current solution improves that, though. If you
>> want to exploit internal features of individual data types, it
>> pretty much requires code customized to every such data type.
>>
>> For example you can't take the tsvector compression and just slap
>> it on tsquery, because it relies on knowledge of internal tsvector
>> structure. So you need separate implementations anyway.
>
> I don't think that's necessarily true. Certainly, it's true that
> *if* tsvector compression depends on knowledge of internal tsvector
> structure, *then* that you can't use the implementation for anything
> else (this, by the way, means that there needs to be some way for a
> compression method to reject being applied to a column of a data
> type it doesn't like).

I believe such dependency (on implementation details) is pretty much the
main benefit of datatype-aware compression methods. If you don't rely on
such assumption, then I'd say it's a general-purpose compression method.

> However, it seems possible to imagine compression algorithms that can
> work for a variety of data types, too. There might be a compression
> algorithm that is theoretically a general-purpose algorithm but has
> features which are particularly well-suited to, say, JSON or XML
> data, because it looks for word boundaries to decide on what strings
> to insert into the compression dictionary.
>

Can you give an example of such algorithm? Because I haven't seen such
example, and I find arguments based on hypothetical compression methods
somewhat suspicious.

FWIW I'm not against considering such compression methods, but OTOH it
may not be such a great primary use case to drive the overall design.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2017-12-17 03:32:09
Message-ID:	CA+Tgmob-O-J2xFWPv7Z8hhCJj2JGet=WEPrUqi8aHwCq2OM8Bg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Dec 14, 2017 at 12:23 PM, Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> Can you give an example of such algorithm? Because I haven't seen such
> example, and I find arguments based on hypothetical compression methods
> somewhat suspicious.
>
> FWIW I'm not against considering such compression methods, but OTOH it
> may not be such a great primary use case to drive the overall design.

Well it isn't, really. I am honestly not sure what we're arguing
about at this point. I think you've agreed that (1) opening avenues
for extensibility is useful, (2) substitution a general-purpose
compression algorithm could be useful, and (3) having datatype
compression that is enabled through TOAST rather than built into the
datatype might sometimes be desirable. That's more than adequate
justification for this proposal, whether half-general compression
methods exist or not. I am prepared to concede that there may be no
useful examples of such a thing.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Евгений Шишкин <itparanoia(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Oleg Bartunov <obartunov(at)gmail(dot)com>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Chapman Flack <chap(at)anastigmatix(dot)net>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2017-12-18 08:54:31
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 14 Dec 2017 10:29:10 -0500
Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> On Wed, Dec 13, 2017 at 7:18 AM, Ildus Kurbangaliev
> <i(dot)kurbangaliev(at)postgrespro(dot)ru> wrote:
> > Since we agreed on ALTER syntax, i want to clear things about
> > CREATE. Should it be CREATE ACCESS METHOD .. TYPE СOMPRESSION or
> > CREATE COMPRESSION METHOD? I like the access method approach, and it
> > simplifies the code, but I'm just not sure a compression is an
> > access method or not.
>
> +1 for ACCESS METHOD.

An access method then.

>
> > Current implementation
> > ----------------------
> >
> > To avoid extra patches I also want to clear things about current
> > implementation. Right now there are two tables, "pg_compression" and
> > "pg_compression_opt". When compression method is linked to a column
> > it creates a record in pg_compression_opt. This record's Oid is
> > stored in the varlena. These Oids kept in first column so I can
> > move them in pg_upgrade but in all other aspects they behave like
> > usual Oids. Also it's easy to restore them.
>
> pg_compression_opt -> pg_attr_compression, maybe.
>
> > Compression options linked to a specific column. When tuple is
> > moved between relations it will be decompressed.
>
> Can we do this only if the compression method isn't OK for the new
> column? For example, if the old column is COMPRESS foo PRESERVE bar
> and the new column is COMPRESS bar PRESERVE foo, we don't need to
> force decompression in any case.

Thanks, sounds right, i will add it to the patch.

--
---
Ildus Kurbangaliev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2017-12-18 15:43:38
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 12/17/2017 04:32 AM, Robert Haas wrote:
> On Thu, Dec 14, 2017 at 12:23 PM, Tomas Vondra
> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> Can you give an example of such algorithm? Because I haven't seen such
>> example, and I find arguments based on hypothetical compression methods
>> somewhat suspicious.
>>
>> FWIW I'm not against considering such compression methods, but OTOH it
>> may not be such a great primary use case to drive the overall design.
>
> Well it isn't, really. I am honestly not sure what we're arguing
> about at this point. I think you've agreed that (1) opening avenues
> for extensibility is useful, (2) substitution a general-purpose
> compression algorithm could be useful, and (3) having datatype
> compression that is enabled through TOAST rather than built into the
> datatype might sometimes be desirable. That's more than adequate
> justification for this proposal, whether half-general compression
> methods exist or not. I am prepared to concede that there may be no
> useful examples of such a thing.
>

I don't think we're arguing - we're discussing if a proposed patch is
the right design solving relevant use cases.

I personally am not quite convinced about that, for the reason I tried
to explain in my previous messages. I see it as a poor alternative to
compression built into the data type. I do like the idea of compression
with external dictionary, however.

But don't forget that it's not me in this thread - it's my evil twin,
moonlighting as Mr. Devil's lawyer ;-)

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2017-12-18 16:12:34
Message-ID:	CA+Tgmoau_qeawSQpod8RLnHwS0uvLkhqBQZNfL4UWpWWHN0NaQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Dec 18, 2017 at 10:43 AM, Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> I personally am not quite convinced about that, for the reason I tried
> to explain in my previous messages. I see it as a poor alternative to
> compression built into the data type. I do like the idea of compression
> with external dictionary, however.

I think that compression built into the datatype and what is proposed
here are both useful and everybody's free to work on either one as the
prefer, so I don't see that as a reason not to accept this patch. And
I think this patch can be a stepping stone toward compression with an
external dictionary, so that seems like an affirmative reason to
accept this patch.

> But don't forget that it's not me in this thread - it's my evil twin,
> moonlighting as Mr. Devil's lawyer ;-)

Well, I don't mind you objecting to the patch under any persona, but
so far I'm not finding your reasons convincing...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Евгений Шишкин <itparanoia(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Oleg Bartunov <obartunov(at)gmail(dot)com>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Chapman Flack <chap(at)anastigmatix(dot)net>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2018-01-14 21:49:30
Message-ID:	20180115024930.48583c69@hh
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Attached a new version of the patch. Main changes:

* compression as an access method
* pglz as default compression access method.
* PRESERVE syntax for tables rewrite control.
* pg_upgrade fixes
* support partitioned tables.
* more tests.

Regards,
Ildus Kurbangaliev

Attachment	Content-Type	Size
custom_compression_methods_v8.patch	text/x-patch	319.3 KB

From:	Ildar Musin <i(dot)musin(at)postgrespro(dot)ru>
To:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>, Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Евгений Шишкин <itparanoia(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Oleg Bartunov <obartunov(at)gmail(dot)com>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Chapman Flack <chap(at)anastigmatix(dot)net>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2018-01-22 20:26:31
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello Ildus,

15/01/2018 00:49, Ildus Kurbangaliev пишет:
> Attached a new version of the patch. Main changes:
>
> * compression as an access method
> * pglz as default compression access method.
> * PRESERVE syntax for tables rewrite control.
> * pg_upgrade fixes
> * support partitioned tables.
> * more tests.
>
You need to rebase to the latest master, there are some conflicts. I've
applied it to the three days old master to try it.

As I can see the documentation is not yet complete. For example, there
is no section for ALTER COLUMN ... SET COMPRESSION in ddl.sgml; and
section "Compression Access Method Functions" in compression-am.sgml
hasn't been finished.

I've implemented an extension [1] to understand the way developer would
go to work with new infrastructure. And for me it seems clear. (Except
that it took me some effort to wrap my mind around varlena macros but it
is probably a different topic).

I noticed that you haven't cover 'cmdrop' in the regression tests and I
saw the previous discussion about it. Have you considered using event
triggers to handle the drop of column compression instead of 'cmdrop'
function? This way you would kill two birds with one stone: it still
provides sufficient infrastructure to catch those events (and it
something postgres already has for different kinds of ddl commands) and
it would be easier to test.

Thanks!

[1] https://github.com/zilder/pg_lz4

--
Ildar Musin
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

From:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>
To:	Ildar Musin <i(dot)musin(at)postgrespro(dot)ru>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Евгений Шишкин <itparanoia(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Oleg Bartunov <obartunov(at)gmail(dot)com>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Chapman Flack <chap(at)anastigmatix(dot)net>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2018-01-23 13:04:54
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, 22 Jan 2018 23:26:31 +0300
Ildar Musin <i(dot)musin(at)postgrespro(dot)ru> wrote:

Thanks for review! Attached new version of the patch. Fixed few bugs,
added more documentation and rebased to current master.

> You need to rebase to the latest master, there are some conflicts.
> I've applied it to the three days old master to try it.

Done.

>
> As I can see the documentation is not yet complete. For example, there
> is no section for ALTER COLUMN ... SET COMPRESSION in ddl.sgml; and
> section "Compression Access Method Functions" in compression-am.sgml
> hasn't been finished.

Not sure about ddl.sgml, it contains more common things, but since
postgres contains only pglz by default there is not much to show.

>
> I've implemented an extension [1] to understand the way developer
> would go to work with new infrastructure. And for me it seems clear.
> (Except that it took me some effort to wrap my mind around varlena
> macros but it is probably a different topic).
>
> I noticed that you haven't cover 'cmdrop' in the regression tests and
> I saw the previous discussion about it. Have you considered using
> event triggers to handle the drop of column compression instead of
> 'cmdrop' function? This way you would kill two birds with one stone:
> it still provides sufficient infrastructure to catch those events
> (and it something postgres already has for different kinds of ddl
> commands) and it would be easier to test.

I have added support for event triggers for ALTER SET COMPRESSION in
current version. Event trigger on ALTER can be used to replace cmdrop
function but it will be far from trivial. There is not easy way to
understand that's attribute compression is really dropping in the
command.

--
---
Ildus Kurbangaliev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Attachment	Content-Type	Size
custom_compression_methods_v9.patch	text/x-patch	322.3 KB

From:	Ildar Musin <i(dot)musin(at)postgrespro(dot)ru>
To:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Евгений Шишкин <itparanoia(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Oleg Bartunov <obartunov(at)gmail(dot)com>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Chapman Flack <chap(at)anastigmatix(dot)net>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2018-01-25 13:03:20
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello Ildus,

On 23.01.2018 16:04, Ildus Kurbangaliev wrote:
> On Mon, 22 Jan 2018 23:26:31 +0300
> Ildar Musin <i(dot)musin(at)postgrespro(dot)ru> wrote:
>
> Thanks for review! Attached new version of the patch. Fixed few bugs,
> added more documentation and rebased to current master.
>
>> You need to rebase to the latest master, there are some conflicts.
>> I've applied it to the three days old master to try it.
>
> Done.
>
>>
>> As I can see the documentation is not yet complete. For example, there
>> is no section for ALTER COLUMN ... SET COMPRESSION in ddl.sgml; and
>> section "Compression Access Method Functions" in compression-am.sgml
>> hasn't been finished.
>
> Not sure about ddl.sgml, it contains more common things, but since
> postgres contains only pglz by default there is not much to show.
>
>>
>> I've implemented an extension [1] to understand the way developer
>> would go to work with new infrastructure. And for me it seems clear.
>> (Except that it took me some effort to wrap my mind around varlena
>> macros but it is probably a different topic).
>>
>> I noticed that you haven't cover 'cmdrop' in the regression tests and
>> I saw the previous discussion about it. Have you considered using
>> event triggers to handle the drop of column compression instead of
>> 'cmdrop' function? This way you would kill two birds with one stone:
>> it still provides sufficient infrastructure to catch those events
>> (and it something postgres already has for different kinds of ddl
>> commands) and it would be easier to test.
>
> I have added support for event triggers for ALTER SET COMPRESSION in
> current version. Event trigger on ALTER can be used to replace cmdrop
> function but it will be far from trivial. There is not easy way to
> understand that's attribute compression is really dropping in the
> command.
>

I've encountered unexpected behavior in command 'CREATE TABLE ... (LIKE
...)'. It seems that it copies compression settings of the table
attributes no matter which INCLUDING options are specified. E.g.

Now copy the table structure with "INCLUDING ALL":

And now copy without "INCLUDING ALL":

As you see, compression option is copied anyway. I suggest adding new
INCLUDING COMPRESSION option to enable user to explicitly specify
whether they want or not to copy compression settings.

I found a few phrases in documentation that can be improved. But the
documentation should be checked by a native speaker.

In compression-am.sgml:
"an compression access method" -> "a compression access method"
"compression method method" -> "compression method"
"compability" -> "compatibility"
Probably "local-backend cached state" would be better to replace with
"per backend cached state"?
"Useful to store the parsed view of the compression options" -> "It
could be useful for example to cache compression options"
"and stores result of" -> "and stores the result of"
"Called when CompressionAmOptions is creating." -> "Called when
<structname>CompressionAmOptions</structname> is being initialized"

"Note that in any system cache invalidation related with
pg_attr_compression relation the options will be cleaned" -> "Note that
any <literal>pg_attr_compression</literal> relation invalidation will
cause all the cached <literal>acstate</literal> options cleared."
"Function used to ..." -> "Function is used to ..."

I think it would be nice to mention custom compression methods in
storage.sgml. At this moment it only mentions built-in pglz compression.

--
Ildar Musin
i(dot)musin(at)postgrespro(dot)ru

From:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>
To:	Ildar Musin <i(dot)musin(at)postgrespro(dot)ru>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Евгений Шишкин <itparanoia(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Oleg Bartunov <obartunov(at)gmail(dot)com>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Chapman Flack <chap(at)anastigmatix(dot)net>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2018-01-25 14:24:57
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 25 Jan 2018 16:03:20 +0300
Ildar Musin <i(dot)musin(at)postgrespro(dot)ru> wrote:

Thanks for review!

>
> As you see, compression option is copied anyway. I suggest adding new
> INCLUDING COMPRESSION option to enable user to explicitly specify
> whether they want or not to copy compression settings.

Good catch, i missed INCLUDE options for LIKE command. Added INCLUDING
COMPRESSION as you suggested.

>
>
> I found a few phrases in documentation that can be improved. But the
> documentation should be checked by a native speaker.
>
> In compression-am.sgml:
> "an compression access method" -> "a compression access method"
> "compression method method" -> "compression method"
> "compability" -> "compatibility"
> Probably "local-backend cached state" would be better to replace with
> "per backend cached state"?
> "Useful to store the parsed view of the compression options" -> "It
> could be useful for example to cache compression options"
> "and stores result of" -> "and stores the result of"
> "Called when CompressionAmOptions is creating." -> "Called when
> <structname>CompressionAmOptions</structname> is being initialized"
>
> "Note that in any system cache invalidation related with
> pg_attr_compression relation the options will be cleaned" -> "Note
> that any <literal>pg_attr_compression</literal> relation invalidation
> will cause all the cached <literal>acstate</literal> options cleared."
> "Function used to ..." -> "Function is used to ..."
>
> I think it would be nice to mention custom compression methods in
> storage.sgml. At this moment it only mentions built-in pglz
> compression.
>

I agree, the documentation would require a native speaker. Fixed the
lines you mentioned.

--
---
Ildus Kurbangaliev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Attachment	Content-Type	Size
custom_compression_methods_v10.patch	text/x-patch	324.0 KB

From:	Ildar Musin <i(dot)musin(at)postgrespro(dot)ru>
To:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Евгений Шишкин <itparanoia(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Oleg Bartunov <obartunov(at)gmail(dot)com>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Chapman Flack <chap(at)anastigmatix(dot)net>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2018-01-26 16:07:28
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello Ildus,

I continue reviewing your patch. Here are some thoughts.

1. When I set column storage to EXTERNAL then I cannot set compression.
Seems reasonable:
create table test(id serial, msg text);
alter table test alter column msg set storage external;
alter table test alter column msg set compression pg_lz4;
ERROR: storage for "msg" should be MAIN or EXTENDED

So we could either allow user to set compression settings even when
storage is EXTERNAL but with warning or prohibit user to set compression
and external storage at the same time. The same thing is with setting
storage PLAIN.

2. I think TOAST_COMPRESS_SET_RAWSIZE macro could be rewritten like
following to prevent overwriting of higher bits of 'info':

((toast_compress_header *) (ptr))->info = \
((toast_compress_header *) (ptr))->info & ~RAWSIZEMASK | (len);

It maybe does not matter at the moment since it is only used once, but
it could save some efforts for other developers in future.
In TOAST_COMPRESS_SET_CUSTOM() instead of changing individual bits you
may do something like this:

#define TOAST_COMPRESS_SET_CUSTOM(ptr) \
do { \
((toast_compress_header *) (ptr))->info = \
((toast_compress_header *) (ptr))->info & RAWSIZEMASK | ((uint32) 0x02
<< 30) \
} while (0)

Also it would be nice if bit flags were explained and maybe replaced by
a macro.

3. In AlteredTableInfo, BulkInsertStateData and some functions (eg
toast_insert_or_update) there is a hash table used to keep preserved
compression methods list per attribute. I think a simple array of List*
would be sufficient in this case.

4. In optionListToArray() you can use list_qsort() to sort options list
instead of converting it manually into array and then back to a list.

5. Redundunt #includes:

In heap.c:
#include "access/reloptions.h"
In tsvector.c:
#include "catalog/pg_type.h"
#include "common/pg_lzcompress.h"
In relcache.c:
#include "utils/datum.h"

6. Just a minor thing: no reason to change formatting in copy.c
- heap_insert(resultRelInfo->ri_RelationDesc, tuple, mycid,
- hi_options, bistate);
+ heap_insert(resultRelInfo->ri_RelationDesc, tuple,
+ mycid, hi_options, bistate);

7. Also in utility.c the extra new line was added which isn't relevant
for this patch.

8. In parse_utilcmd.h the 'extern' keyword was removed from
transformRuleStmt declaration which doesn't make sense in this patch.

9. Comments. Again, they should be read by a native speaker. So just a
few suggestions:
toast_prepare_varlena() - comment needed
invalidate_amoptions_cache() - comment format doesn't match other
functions in the file

In htup_details.h:
/* tuple contain custom compressed
* varlenas */
should be "contains"

--
Ildar Musin
i(dot)musin(at)postgrespro(dot)ru

From:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>
To:	Ildar Musin <i(dot)musin(at)postgrespro(dot)ru>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Евгений Шишкин <itparanoia(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Oleg Bartunov <obartunov(at)gmail(dot)com>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Chapman Flack <chap(at)anastigmatix(dot)net>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2018-01-29 11:44:52
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 26 Jan 2018 19:07:28 +0300
Ildar Musin <i(dot)musin(at)postgrespro(dot)ru> wrote:

> Hello Ildus,
>
> I continue reviewing your patch. Here are some thoughts.

Thanks! Attached new version of the patch.

>
> 1. When I set column storage to EXTERNAL then I cannot set
> compression. Seems reasonable:
> create table test(id serial, msg text);
> alter table test alter column msg set storage external;
> alter table test alter column msg set compression pg_lz4;
> ERROR: storage for "msg" should be MAIN or EXTENDED

Changed the behaviour, now it's ok to change storages in any
directions for toastable types. Also added protection from untoastable
types.

>
>
> 2. I think TOAST_COMPRESS_SET_RAWSIZE macro could be rewritten like
> following to prevent overwriting of higher bits of 'info':
>
> ((toast_compress_header *) (ptr))->info = \
> ((toast_compress_header *) (ptr))->info & ~RAWSIZEMASK |
> (len);
>
> It maybe does not matter at the moment since it is only used once, but
> it could save some efforts for other developers in future.
> In TOAST_COMPRESS_SET_CUSTOM() instead of changing individual bits you
> may do something like this:
>
> #define TOAST_COMPRESS_SET_CUSTOM(ptr) \
> do { \
> ((toast_compress_header *) (ptr))->info = \
> ((toast_compress_header *) (ptr))->info & RAWSIZEMASK
> | ((uint32) 0x02 << 30) \
> } while (0)
>
> Also it would be nice if bit flags were explained and maybe replaced
> by a macro.

I noticed that there is no need of TOAST_COMPRESS_SET_CUSTOM at all,
so I just removed it, TOAST_COMPRESS_SET_RAWSIZE will set needed flags.

>
>
> 3. In AlteredTableInfo, BulkInsertStateData and some functions (eg
> toast_insert_or_update) there is a hash table used to keep preserved
> compression methods list per attribute. I think a simple array of
> List* would be sufficient in this case.

Not sure about that, it will just complicate things without sufficient
improvements. Also it would require the passing the length of the array
and require more memory for tables with large number of attributes. But,
I've made default size of the hash table smaller, since unlikely the
user will change compression of many attributes at once.

>
>
> 4. In optionListToArray() you can use list_qsort() to sort options
> list instead of converting it manually into array and then back to a
> list.

Good, didn't know about this function.

>
>
> 5. Redundunt #includes:
>
> In heap.c:
> #include "access/reloptions.h"
> In tsvector.c:
> #include "catalog/pg_type.h"
> #include "common/pg_lzcompress.h"
> In relcache.c:
> #include "utils/datum.h"
>
> 6. Just a minor thing: no reason to change formatting in copy.c
> - heap_insert(resultRelInfo->ri_RelationDesc, tuple, mycid,
> - hi_options, bistate);
> + heap_insert(resultRelInfo->ri_RelationDesc, tuple,
> + mycid, hi_options, bistate);
>
> 7. Also in utility.c the extra new line was added which isn't
> relevant for this patch.
>
> 8. In parse_utilcmd.h the 'extern' keyword was removed from
> transformRuleStmt declaration which doesn't make sense in this patch.
>
> 9. Comments. Again, they should be read by a native speaker. So just
> a few suggestions:
> toast_prepare_varlena() - comment needed
> invalidate_amoptions_cache() - comment format doesn't match other
> functions in the file
>
> In htup_details.h:
> /* tuple contain custom compressed
> * varlenas */
> should be "contains"
>

5-9, all done. Thank you for noticing.

--
---
Ildus Kurbangaliev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Attachment	Content-Type	Size
custom_compression_methods_v11.patch	text/x-patch	325.4 KB

From:	Ildar Musin <i(dot)musin(at)postgrespro(dot)ru>
To:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Евгений Шишкин <itparanoia(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Oleg Bartunov <obartunov(at)gmail(dot)com>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Chapman Flack <chap(at)anastigmatix(dot)net>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2018-01-29 14:29:29
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello Ildus,

On 29.01.2018 14:44, Ildus Kurbangaliev wrote:
>
> Thanks! Attached new version of the patch.
>

Patch applies cleanly, builds without any warnings, documentation builds
ok, all tests pass.

A remark for the committers. The patch is quite big, so I really wish
more reviewers looked into it for more comprehensive review. Also a
native english speaker should check the documentation and comments.
Another thing is that tests don't cover cmdrop method because the
built-in pglz compression doesn't use it (I know there is an jsonbd
extension [1] based on this patch and which should benefit from cmdrop
method, but it doesn't test it either yet).

I think I did what I could and so passing this patch to committers for
the review. Changed status to "Ready for committer".

[1] https://github.com/postgrespro/jsonbd

--
Ildar Musin
i(dot)musin(at)postgrespro(dot)ru

From:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>
To:	Ildar Musin <i(dot)musin(at)postgrespro(dot)ru>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Евгений Шишкин <itparanoia(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Oleg Bartunov <obartunov(at)gmail(dot)com>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Chapman Flack <chap(at)anastigmatix(dot)net>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2018-01-29 16:04:03
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, 29 Jan 2018 17:29:29 +0300
Ildar Musin <i(dot)musin(at)postgrespro(dot)ru> wrote:

>
> Patch applies cleanly, builds without any warnings, documentation
> builds ok, all tests pass.
>
> A remark for the committers. The patch is quite big, so I really wish
> more reviewers looked into it for more comprehensive review. Also a
> native english speaker should check the documentation and comments.
> Another thing is that tests don't cover cmdrop method because the
> built-in pglz compression doesn't use it (I know there is an jsonbd
> extension [1] based on this patch and which should benefit from
> cmdrop method, but it doesn't test it either yet).
>
> I think I did what I could and so passing this patch to committers
> for the review. Changed status to "Ready for committer".
>
>
> [1] https://github.com/postgrespro/jsonbd
>

Thank you!

About cmdrop, I checked that's is called manually, but going to check
it thoroughly in my extension.

--
---
Ildus Kurbangaliev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

From:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>
To:	Ildar Musin <i(dot)musin(at)postgrespro(dot)ru>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Евгений Шишкин <itparanoia(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Oleg Bartunov <obartunov(at)gmail(dot)com>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Chapman Flack <chap(at)anastigmatix(dot)net>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2018-02-06 10:47:31
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, 29 Jan 2018 17:29:29 +0300
Ildar Musin <i(dot)musin(at)postgrespro(dot)ru> wrote:

> Hello Ildus,
>
> On 29.01.2018 14:44, Ildus Kurbangaliev wrote:
> >
> > Thanks! Attached new version of the patch.
> >
>
> Patch applies cleanly, builds without any warnings, documentation
> builds ok, all tests pass.
>
> A remark for the committers. The patch is quite big, so I really wish
> more reviewers looked into it for more comprehensive review. Also a
> native english speaker should check the documentation and comments.
> Another thing is that tests don't cover cmdrop method because the
> built-in pglz compression doesn't use it (I know there is an jsonbd
> extension [1] based on this patch and which should benefit from
> cmdrop method, but it doesn't test it either yet).
>
> I think I did what I could and so passing this patch to committers
> for the review. Changed status to "Ready for committer".
>
>
> [1] https://github.com/postgrespro/jsonbd
>

Attached rebased version of the patch so it can be applied to current
master.

--
---
Ildus Kurbangaliev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Attachment	Content-Type	Size
custom_compression_methods_v12.patch	text/x-patch	328.1 KB

From:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>
To:	Ildar Musin <i(dot)musin(at)postgrespro(dot)ru>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Евгений Шишкин <itparanoia(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Oleg Bartunov <obartunov(at)gmail(dot)com>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Chapman Flack <chap(at)anastigmatix(dot)net>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2018-02-26 12:25:56
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,
Attached new version of the patch, rebased to current master, and fixed
conflicting catalog Oids.

--
---
Ildus Kurbangaliev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Attachment	Content-Type	Size
custom_compression_methods_v13.patch	text/x-patch	328.6 KB

From:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>
To:	Ildar Musin <i(dot)musin(at)postgrespro(dot)ru>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Евгений Шишкин <itparanoia(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Oleg Bartunov <obartunov(at)gmail(dot)com>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Chapman Flack <chap(at)anastigmatix(dot)net>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2018-03-14 13:31:23
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, 26 Feb 2018 15:25:56 +0300
Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru> wrote:

> Hi,
> Attached new version of the patch, rebased to current master, and
> fixed conflicting catalog Oids.
>

Attached rebased version of the patch, fixed conficts in pg_proc, and
tap tests.

--
---
Ildus Kurbangaliev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Attachment	Content-Type	Size
custom_compression_methods_v14.patch	text/x-patch	328.6 KB

From:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>
To:	Ildar Musin <i(dot)musin(at)postgrespro(dot)ru>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Евгений Шишкин <itparanoia(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Oleg Bartunov <obartunov(at)gmail(dot)com>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Chapman Flack <chap(at)anastigmatix(dot)net>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2018-03-26 17:38:25
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Attached rebased version of the patch. Fixed conflicts in pg_class.h.

--
----
Regards,
Ildus Kurbangaliev

Attachment	Content-Type	Size
custom_compression_methods_v14.patch	text/x-patch	328.9 KB

From:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>
To:	Ildar Musin <i(dot)musin(at)postgrespro(dot)ru>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Евгений Шишкин <itparanoia(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Oleg Bartunov <obartunov(at)gmail(dot)com>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Chapman Flack <chap(at)anastigmatix(dot)net>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2018-03-30 16:50:45
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, 26 Mar 2018 20:38:25 +0300
Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru> wrote:

> Attached rebased version of the patch. Fixed conflicts in pg_class.h.
>

New rebased version due to conflicts in master. Also fixed few errors
and removed cmdrop method since it couldnt be tested.

--
---
Ildus Kurbangaliev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Attachment	Content-Type	Size
custom_compression_methods_v15.patch	text/x-patch	329.5 KB

From:	Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>
To:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>, Ildar Musin <i(dot)musin(at)postgrespro(dot)ru>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Евгений Шишкин <itparanoia(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Oleg Bartunov <obartunov(at)gmail(dot)com>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Chapman Flack <chap(at)anastigmatix(dot)net>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2018-04-20 16:45:06
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 30.03.2018 19:50, Ildus Kurbangaliev wrote:
> On Mon, 26 Mar 2018 20:38:25 +0300
> Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru> wrote:
>
>> Attached rebased version of the patch. Fixed conflicts in pg_class.h.
>>
> New rebased version due to conflicts in master. Also fixed few errors
> and removed cmdrop method since it couldnt be tested.
>
I seems to be useful (and not so difficult) to use custom compression
methods also for WAL compression: replace direct calls of pglz_compress
in xloginsert.c

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

From:	Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
To:	Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>
Cc:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>, Ildar Musin <i(dot)musin(at)postgrespro(dot)ru>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Евгений Шишкин <itparanoia(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Oleg Bartunov <obartunov(at)gmail(dot)com>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Chapman Flack <chap(at)anastigmatix(dot)net>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2018-04-22 13:21:31
Message-ID:	CAPpHfdtmy_hN_-D9OJ-BwHb_PqUcKWq97kZPtOv6o1g065L8jw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Apr 20, 2018 at 7:45 PM, Konstantin Knizhnik <
k(dot)knizhnik(at)postgrespro(dot)ru> wrote:

> On 30.03.2018 19:50, Ildus Kurbangaliev wrote:
>
>> On Mon, 26 Mar 2018 20:38:25 +0300
>> Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru> wrote:
>>
>> Attached rebased version of the patch. Fixed conflicts in pg_class.h.
>>>
>>> New rebased version due to conflicts in master. Also fixed few errors
>> and removed cmdrop method since it couldnt be tested.
>>
>> I seems to be useful (and not so difficult) to use custom compression
> methods also for WAL compression: replace direct calls of pglz_compress in
> xloginsert.c

I'm going to object this at point, and I've following arguments for that:

1) WAL compression is much more critical for durability than datatype
compression. Imagine, compression algorithm contains a bug which
cause decompress method to issue a segfault. In the case of datatype
compression, that would cause crash on access to some value which
causes segfault; but in the rest database will be working giving you
a chance to localize the issue and investigate that. In the case of
WAL compression, recovery would cause a server crash. That seems
to be much more serious disaster. You wouldn't be able to make
your database up and running and the same happens on the standby.

2) Idea of custom compression method is that some columns may
have specific data distribution, which could be handled better with
particular compression method and particular parameters. In the
WAL compression you're dealing with the whole WAL stream containing
all the values from database cluster. Moreover, if custom compression
method are defined for columns, then in WAL stream you've values
already compressed in the most efficient way. However, it might
appear that some compression method is better for WAL in general
case (there are benchmarks showing our pglz is not very good in
comparison to the alternatives). But in this case I would prefer to just
switch our WAL to different compression method one day. Thankfully
we don't preserve WAL compatibility between major releases.

3) This patch provides custom compression methods recorded in
the catalog. During recovery you don't have access to the system
catalog, because it's not recovered yet, and can't fetch compression
method metadata from there. The possible thing is to have GUC,
which stores shared module and function names for WAL compression.
But that seems like quite different mechanism from the one present
in this patch.

Taking into account all of above, I think we would give up with custom
WAL compression method. Or, at least, consider it unrelated to this
patch.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

From:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>
To:	Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Cc:	Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>, Ildar Musin <i(dot)musin(at)postgrespro(dot)ru>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Евгений Шишкин <itparanoia(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Oleg Bartunov <obartunov(at)gmail(dot)com>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Chapman Flack <chap(at)anastigmatix(dot)net>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2018-04-23 09:19:09
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, 22 Apr 2018 16:21:31 +0300
Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru> wrote:

> On Fri, Apr 20, 2018 at 7:45 PM, Konstantin Knizhnik <
> k(dot)knizhnik(at)postgrespro(dot)ru> wrote:
>
> > On 30.03.2018 19:50, Ildus Kurbangaliev wrote:
> >
> >> On Mon, 26 Mar 2018 20:38:25 +0300
> >> Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru> wrote:
> >>
> >> Attached rebased version of the patch. Fixed conflicts in
> >> pg_class.h.
> >>>
> >>> New rebased version due to conflicts in master. Also fixed few
> >>> errors
> >> and removed cmdrop method since it couldnt be tested.
> >>
> >> I seems to be useful (and not so difficult) to use custom
> >> compression
> > methods also for WAL compression: replace direct calls of
> > pglz_compress in xloginsert.c
>
>
> I'm going to object this at point, and I've following arguments for
> that:
>
> 1) WAL compression is much more critical for durability than datatype
> compression. Imagine, compression algorithm contains a bug which
> cause decompress method to issue a segfault. In the case of datatype
> compression, that would cause crash on access to some value which
> causes segfault; but in the rest database will be working giving you
> a chance to localize the issue and investigate that. In the case of
> WAL compression, recovery would cause a server crash. That seems
> to be much more serious disaster. You wouldn't be able to make
> your database up and running and the same happens on the standby.
>
> 2) Idea of custom compression method is that some columns may
> have specific data distribution, which could be handled better with
> particular compression method and particular parameters. In the
> WAL compression you're dealing with the whole WAL stream containing
> all the values from database cluster. Moreover, if custom compression
> method are defined for columns, then in WAL stream you've values
> already compressed in the most efficient way. However, it might
> appear that some compression method is better for WAL in general
> case (there are benchmarks showing our pglz is not very good in
> comparison to the alternatives). But in this case I would prefer to
> just switch our WAL to different compression method one day.
> Thankfully we don't preserve WAL compatibility between major releases.
>
> 3) This patch provides custom compression methods recorded in
> the catalog. During recovery you don't have access to the system
> catalog, because it's not recovered yet, and can't fetch compression
> method metadata from there. The possible thing is to have GUC,
> which stores shared module and function names for WAL compression.
> But that seems like quite different mechanism from the one present
> in this patch.
>
> Taking into account all of above, I think we would give up with custom
> WAL compression method. Or, at least, consider it unrelated to this
> patch.

I agree with these points. I also think this should be done in another
patch. It's not so hard to implement but would make sense if there will
be few more builtin compression methods suitable for wal compression.
Some static array could contain function pointers for direct calls.

--
---
Ildus Kurbangaliev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

From:	Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>
To:	Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Cc:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>, Ildar Musin <i(dot)musin(at)postgrespro(dot)ru>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Евгений Шишкин <itparanoia(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Oleg Bartunov <obartunov(at)gmail(dot)com>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Chapman Flack <chap(at)anastigmatix(dot)net>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2018-04-23 09:40:33
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 22.04.2018 16:21, Alexander Korotkov wrote:
> On Fri, Apr 20, 2018 at 7:45 PM, Konstantin Knizhnik
> <k(dot)knizhnik(at)postgrespro(dot)ru <mailto:k(dot)knizhnik(at)postgrespro(dot)ru>> wrote:
>
> On 30.03.2018 19:50, Ildus Kurbangaliev wrote:
>
> On Mon, 26 Mar 2018 20:38:25 +0300
> Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru
> <mailto:i(dot)kurbangaliev(at)postgrespro(dot)ru>> wrote:
>
> Attached rebased version of the patch. Fixed conflicts in
> pg_class.h.
>
> New rebased version due to conflicts in master. Also fixed few
> errors
> and removed cmdrop method since it couldnt be tested.
>
> I seems to be useful (and not so difficult) to use custom
> compression methods also for WAL compression: replace direct calls
> of pglz_compress in xloginsert.c
>
>
> I'm going to object this at point, and I've following arguments for that:
>
> 1) WAL compression is much more critical for durability than datatype
> compression. Imagine, compression algorithm contains a bug which
> cause decompress method to issue a segfault. In the case of datatype
> compression, that would cause crash on access to some value which
> causes segfault; but in the rest database will be working giving you
> a chance to localize the issue and investigate that. In the case of
> WAL compression, recovery would cause a server crash. That seems
> to be much more serious disaster. You wouldn't be able to make
> your database up and running and the same happens on the standby.
>
Well, I do not think that somebody will try to implement its own
compression algorithm...
From my point of view the main value of this patch is that it allows to
replace pglz algorithm with more efficient one, for example zstd.
At some data sets zstd provides more than 10 times better compression
ratio and at the same time is faster then pglz.
I do not think that risk of data corruption caused by WAL compression
with some alternative compression algorithm (zlib, zstd,...) is higher
than in case of using builtin Postgres compression.

Frankly speaking I do not believe that somebody will use custom
compression in this way: implement its own compression methods for the
specific data type.
May be just for json/jsonb, but also only in the case when custom
compression API allows to separately store compression dictionary (which
as far as I understand is not currently supported).

When I worked for SciDB (database for scientists which has to deal
mostly with multidimensional arrays of data) our first intention was to
implement custom compression methods for the particular data types and
data distributions. For example, there are very fast, simple and
efficient algorithms for encoding sequence of monotonically increased
integers, ....
But after several experiments we rejected this idea and switch to using
generic compression methods. Mostly because we do not want compressor to
know much about page layout, data type representation,... In Postgres,
from my point of view, we have similar situation. Assume that we have
column of serial type. So it is good candidate of compression, isn't it?
But this approach deals only with particular attribute values. It can
not take any advantages from the fact that this particular column is
monotonically increased. It can be done only with page level
compression, but it is a different story.

So current approach works only for blob-like types: text, json,... But
them usually have quite complex internal structure and for them
universal compression algorithms used to be more efficient than any
hand-written specific implementation. Also algorithms like zstd, are
able to efficiently recognize and compress many common data
distributions, line monotonic sequences, duplicates, repeated series,...

>
> 3) This patch provides custom compression methods recorded in
> the catalog. During recovery you don't have access to the system
> catalog, because it's not recovered yet, and can't fetch compression
> method metadata from there. The possible thing is to have GUC,
> which stores shared module and function names for WAL compression.
> But that seems like quite different mechanism from the one present
> in this patch.
>
I do not think that assignment default compression method through GUC is
so bad idea.

> Taking into account all of above, I think we would give up with custom
> WAL compression method. Or, at least, consider it unrelated to this
> patch.
>

Sorry for repeating the same thing, but from my point of view the main
advantage of this patch is that it allows to replace pglz with more
efficient compression algorithms.
I do not see much sense in specifying custom compression method for some
particular columns.
It will be more useful from my point of view to include in this patch
implementation of compression API not only or pglz, but also for zlib,
zstd and may be for some other popular compressing libraries which
proved their efficiency.

Postgres already has zlib dependency (unless explicitly excluded with
--without-zlib), so zlib implementation can be included in Postgres build.
Other implementations can be left as module which customer can build
himself. It is certainly less convenient, than using preexistred stuff,
but much more convenient then making users to write this code themselves.

There is yet another aspect which is not covered by this patch:
streaming compression.
Streaming compression is needed if we want to compress libpq traffic. It
can be very efficient for COPY command and for replication. Also libpq
compression can improve speed of queries returning large results (for
example containing JSON columns) throw slow network.
I have proposed such patch for libpq, which is using either zlib,
either zstd streaming API. Postgres built-in compression implementation
doesn't have streaming API at all, so it can not be used here. Certainly
support of streaming may significantly complicates compression API, so
I am not sure that it actually needed to be included in this patch.
But I will be pleased if Ildus can consider this idea.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

From:	Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
To:	Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>
Cc:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>, Ildar Musin <i(dot)musin(at)postgrespro(dot)ru>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Евгений Шишкин <itparanoia(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Oleg Bartunov <obartunov(at)gmail(dot)com>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Chapman Flack <chap(at)anastigmatix(dot)net>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2018-04-23 15:32:28
Message-ID:	CAPpHfdtxo7VYd1hrrXAMkBZOk-ZPNhH=SFaJwaGedT5sZ34VrA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Apr 23, 2018 at 12:40 PM, Konstantin Knizhnik <
k(dot)knizhnik(at)postgrespro(dot)ru> wrote:

> On 22.04.2018 16:21, Alexander Korotkov wrote:
>
> On Fri, Apr 20, 2018 at 7:45 PM, Konstantin Knizhnik <
> k(dot)knizhnik(at)postgrespro(dot)ru> wrote:
>
>> On 30.03.2018 19:50, Ildus Kurbangaliev wrote:
>>
>>> On Mon, 26 Mar 2018 20:38:25 +0300
>>> Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru> wrote:
>>>
>>> Attached rebased version of the patch. Fixed conflicts in pg_class.h.
>>>>
>>>> New rebased version due to conflicts in master. Also fixed few errors
>>> and removed cmdrop method since it couldnt be tested.
>>>
>>> I seems to be useful (and not so difficult) to use custom compression
>> methods also for WAL compression: replace direct calls of pglz_compress in
>> xloginsert.c
>
>
> I'm going to object this at point, and I've following arguments for that:
>
> 1) WAL compression is much more critical for durability than datatype
> compression. Imagine, compression algorithm contains a bug which
> cause decompress method to issue a segfault. In the case of datatype
> compression, that would cause crash on access to some value which
> causes segfault; but in the rest database will be working giving you
> a chance to localize the issue and investigate that. In the case of
> WAL compression, recovery would cause a server crash. That seems
> to be much more serious disaster. You wouldn't be able to make
> your database up and running and the same happens on the standby.
>
> Well, I do not think that somebody will try to implement its own
> compression algorithm...
>

But that the main goal of this patch: let somebody implement own compression
algorithm which best fit for particular dataset.

> From my point of view the main value of this patch is that it allows to
> replace pglz algorithm with more efficient one, for example zstd.
> At some data sets zstd provides more than 10 times better compression
> ratio and at the same time is faster then pglz.
>

Not exactly. If we want to replace pglz with more efficient one, then we
should
just replace pglz with better algorithm. Pluggable compression methods are
definitely don't worth it for just replacing pglz with zstd.

> I do not think that risk of data corruption caused by WAL compression with
> some alternative compression algorithm (zlib, zstd,...) is higher than in
> case of using builtin Postgres compression.
>

It speaking about zlib or zstd, then yes risk of corruption is very low.
But again,
switching to zlib or zstd don't justify this patch.

> 2) Idea of custom compression method is that some columns may
> have specific data distribution, which could be handled better with
> particular compression method and particular parameters. In the
> WAL compression you're dealing with the whole WAL stream containing
> all the values from database cluster. Moreover, if custom compression
> method are defined for columns, then in WAL stream you've values
> already compressed in the most efficient way. However, it might
> appear that some compression method is better for WAL in general
> case (there are benchmarks showing our pglz is not very good in
> comparison to the alternatives). But in this case I would prefer to just
> switch our WAL to different compression method one day. Thankfully
> we don't preserve WAL compatibility between major releases.
>
>
> Frankly speaking I do not believe that somebody will use custom
> compression in this way: implement its own compression methods for the
> specific data type.
> May be just for json/jsonb, but also only in the case when custom
> compression API allows to separately store compression dictionary (which as
> far as I understand is not currently supported).
>
> When I worked for SciDB (database for scientists which has to deal mostly
> with multidimensional arrays of data) our first intention was to implement
> custom compression methods for the particular data types and data
> distributions. For example, there are very fast, simple and efficient
> algorithms for encoding sequence of monotonically increased integers, ....
> But after several experiments we rejected this idea and switch to using
> generic compression methods. Mostly because we do not want compressor to
> know much about page layout, data type representation,... In Postgres, from
> my point of view, we have similar situation. Assume that we have column of
> serial type. So it is good candidate of compression, isn't it?
>

No, it's not. Exactly because compressor shouldn't deal with page layout
etc.
But it's absolutely OK for datatype compressor to deal with particular type
representation.

> But this approach deals only with particular attribute values. It can not
> take any advantages from the fact that this particular column is
> monotonically increased. It can be done only with page level compression,
> but it is a different story.
>

Yes, compression of data series spear across multiple rows is different
story.

> So current approach works only for blob-like types: text, json,... But
> them usually have quite complex internal structure and for them universal
> compression algorithms used to be more efficient than any hand-written
> specific implementation. Also algorithms like zstd, are able to efficiently
> recognize and compress many common data distributions, line monotonic
> sequences, duplicates, repeated series,...
>

Some types blob-like datatypes might be not long enough to let generic
compression algorithms like zlib or zstd train a dictionary. For example,
MySQL successfully utilize column-level dictionaries for JSON [1]. Also
JSON(B) might utilize some compression which let user extract
particular attributes without decompression of the whole document.

> 3) This patch provides custom compression methods recorded in
> the catalog. During recovery you don't have access to the system
> catalog, because it's not recovered yet, and can't fetch compression
> method metadata from there. The possible thing is to have GUC,
> which stores shared module and function names for WAL compression.
> But that seems like quite different mechanism from the one present
> in this patch.
>
> I do not think that assignment default compression method through GUC is
> so bad idea.
>

It's probably not so bad, but it's a different story. Unrelated to this
patch, I think.

Taking into account all of above, I think we would give up with custom
> WAL compression method. Or, at least, consider it unrelated to this
> patch.
>
> Sorry for repeating the same thing, but from my point of view the main
> advantage of this patch is that it allows to replace pglz with more
> efficient compression algorithms.
> I do not see much sense in specifying custom compression method for some
> particular columns.
>

This patch is about giving user an ability to select particular compression
method and its parameters for particular column.

> It will be more useful from my point of view to include in this patch
> implementation of compression API not only or pglz, but also for zlib, zstd
> and may be for some other popular compressing libraries which proved their
> efficiency.
>
> Postgres already has zlib dependency (unless explicitly excluded with
> --without-zlib), so zlib implementation can be included in Postgres build.
> Other implementations can be left as module which customer can build
> himself. It is certainly less convenient, than using preexistred stuff, but
> much more convenient then making users to write this code themselves.
>
> There is yet another aspect which is not covered by this patch: streaming
> compression.
> Streaming compression is needed if we want to compress libpq traffic. It
> can be very efficient for COPY command and for replication. Also libpq
> compression can improve speed of queries returning large results (for
> example containing JSON columns) throw slow network.
> I have proposed such patch for libpq, which is using either zlib, either
> zstd streaming API. Postgres built-in compression implementation doesn't
> have streaming API at all, so it can not be used here. Certainly support of
> streaming may significantly complicates compression API, so I am not sure
> that it actually needed to be included in this patch.
> But I will be pleased if Ildus can consider this idea.
>

I think streaming compression seems like a completely different story.
client-server traffic compression is not just server feature. It must
be also supported at client side. And I really doubt it should be
pluggable.

In my opinion, you propose good things like compression of WAL
with better algorithm and compression of client-server traffic.
But I think those features are unrelated to this patch and should
be considered separately. It's not features, which should be
added to this patch. Regarding this patch the points you provided
more seems like criticism of the general idea.

I think the problem of this patch is that it lacks of good example.
It would be nice if Ildus implement simple compression with
column-defined dictionary (like [1] does), and show its efficiency
of real-life examples, which can't be achieved with generic
compression methods (like zlib or zstd). That would be a good
answer to the criticism you provide.

*Links*

1.
https://www.percona.com/doc/percona-server/LATEST/flexibility/compressed_columns.html

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

From:	Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>
To:	Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Cc:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>, Ildar Musin <i(dot)musin(at)postgrespro(dot)ru>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Евгений Шишкин <itparanoia(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Oleg Bartunov <obartunov(at)gmail(dot)com>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Chapman Flack <chap(at)anastigmatix(dot)net>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2018-04-23 16:34:38
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 23.04.2018 18:32, Alexander Korotkov wrote:
> But that the main goal of this patch: let somebody implement own
> compression
> algorithm which best fit for particular dataset.

Hmmm...Frankly speaking I don't believe in this "somebody".

>
> From my point of view the main value of this patch is that it
> allows to replace pglz algorithm with more efficient one, for
> example zstd.
> At some data sets zstd provides more than 10 times better
> compression ratio and at the same time is faster then pglz.
>
>
> Not exactly. If we want to replace pglz with more efficient one, then
> we should
> just replace pglz with better algorithm. Pluggable compression
> methods are
> definitely don't worth it for just replacing pglz with zstd.

As far as I understand it is not possible for many reasons (portability,
patents,...) to replace pglz with zstd.
I think that even replacing pglz with zlib (which is much worser than
zstd) will not be accepted by community.
So from my point of view the main advantage of custom compression method
is to replace builting pglz compression with more advanced one.

> Some types blob-like datatypes might be not long enough to let generic
> compression algorithms like zlib or zstd train a dictionary. For example,
> MySQL successfully utilize column-level dictionaries for JSON [1]. Also
> JSON(B) might utilize some compression which let user extract
> particular attributes without decompression of the whole document.

Well, I am not an expert in compression.
But I will be very surprised if somebody will show me some real example
with large enough compressed data buffer (>2kb) where some specialized
algorithm will provide significantly
better compression ratio than advanced universal compression algorithm.

Also may be I missed something, but current compression API doesn't
support partial extraction (extra some particular attribute or range).
If we really need it, then it should be expressed in custom compressor
API. But I am not sure how frequently it will needed.
Large values are splitted into 2kb TOAST chunks. With compression it can
be about 4-8k of raw data. IMHO storing larger JSON objects is database
design flaw.
And taken in account that in JSONB we need also extract header (so at
least two chunks), it makes more obscure advantages of partial JSONB
decompression.

>>
> I do not think that assignment default compression method through
> GUC is so bad idea.
>
>
> It's probably not so bad, but it's a different story. Unrelated to
> this patch, I think.

May be. But in any cases, there are several direction where compression
can be used:
- custom compression algorithms
- libpq compression
- page level compression
...

and them should be somehow finally "married" with each other.

>
> I think streaming compression seems like a completely different story.
> client-server traffic compression is not just server feature. It must
> be also supported at client side. And I really doubt it should be
> pluggable.
>
> In my opinion, you propose good things like compression of WAL
> with better algorithm and compression of client-server traffic.
> But I think those features are unrelated to this patch and should
> be considered separately. It's not features, which should be
> added to this patch. Regarding this patch the points you provided
> more seems like criticism of the general idea.
>
> I think the problem of this patch is that it lacks of good example.
> It would be nice if Ildus implement simple compression with
> column-defined dictionary (like [1] does), and show its efficiency
> of real-life examples, which can't be achieved with generic
> compression methods (like zlib or zstd). That would be a good
> answer to the criticism you provide.
>
> *Links*
>
> 1.
> https://www.percona.com/doc/percona-server/LATEST/flexibility/compressed_columns.html
>
> ------
> Alexander Korotkov
> Postgres Professional:http://www.postgrespro.com
> <http://www.postgrespro.com/>
> The Russian Postgres Company
>
Sorry, I really looking at this patch under the different angle.
And this is why I have some doubts about general idea.
Postgres allows to defined custom types, access methods,...
But do you know any production system using some special data types or
custom indexes which are not included in standard Postgres distribution
or popular extensions (like postgis)?

IMHO end-user do not have skills and time to create their own
compression algorithms. And without knowledge of specific of particular
data set,
it is very hard to implement something more efficient than universal
compression library.
But if you think that it is not a right place and time to discuss it, I
do not insist.

But in any case, I think that it will be useful to provide some more
examples of custom compression API usage.
From my point of view the most useful will be integration with zstd.
But if it is possible to find some example of data-specific compression
algorithms which show better results than universal compression,
it will be even more impressive.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

From:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>
To:	Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>
Cc:	Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, Ildar Musin <i(dot)musin(at)postgrespro(dot)ru>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Евгений Шишкин <itparanoia(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Oleg Bartunov <obartunov(at)gmail(dot)com>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Chapman Flack <chap(at)anastigmatix(dot)net>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2018-04-23 18:47:46
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, 23 Apr 2018 19:34:38 +0300
Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru> wrote:

> >
> Sorry, I really looking at this patch under the different angle.
> And this is why I have some doubts about general idea.
> Postgres allows to defined custom types, access methods,...
> But do you know any production system using some special data types
> or custom indexes which are not included in standard Postgres
> distribution or popular extensions (like postgis)?
>
> IMHO end-user do not have skills and time to create their own
> compression algorithms. And without knowledge of specific of
> particular data set,
> it is very hard to implement something more efficient than universal
> compression library.
> But if you think that it is not a right place and time to discuss it,
> I do not insist.
>
> But in any case, I think that it will be useful to provide some more
> examples of custom compression API usage.
> From my point of view the most useful will be integration with zstd.
> But if it is possible to find some example of data-specific
> compression algorithms which show better results than universal
> compression, it will be even more impressive.
>
>

Ok, let me clear up the purpose of this patch. I understand that you
want to compress everything by it but now the idea is just to bring
basic functionality to compress toasting values with external
compression algorithms. It's unlikely that compression algorithms like
zstd, snappy and others will be in postgres core but with this patch
it's really easy to make an extension and start to compress values
using it right away. And the end-user should not be expert in
compression algorithms to make such extension. One of these algorithms
could end up in core if its license will allow it.

I'm not trying to cover all the places in postgres which will benefit
from compression, and this patch only is the first step. It's quite big
already and with every new feature that will increase its size, chances
of its reviewing and commiting will decrease.

The API is very simple now and contains what an every compression
method can do - get some block of data and return a compressed form
of the data. And it can be extended with streaming and other features in
the future.

Maybe the reason of your confusion is that there is no GUC that changes
pglz to some custom compression so all new attributes will use it. I
will think about adding it. Also there was a discussion about
specifying the compression for the type and it was decided that's
better to do it later by a separate patch.

As an example of specialized compression could be time series
compression described in [1]. [2] contains an example of an extension
that adds lz4 compression using this patch.

[1] http://www.vldb.org/pvldb/vol8/p1816-teller.pdf
[2] https://github.com/zilder/pg_lz4

--
----
Regards,
Ildus Kurbangaliev

From:	Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
To:	Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>
Cc:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>, Ildar Musin <i(dot)musin(at)postgrespro(dot)ru>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Евгений Шишкин <itparanoia(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Oleg Bartunov <obartunov(at)gmail(dot)com>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Chapman Flack <chap(at)anastigmatix(dot)net>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2018-04-24 11:05:20
Message-ID:	CAPpHfduorgpFjQprDPXcUz3MYiau66z+fdrrepxhF+ZfDcy-aA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Apr 23, 2018 at 7:34 PM, Konstantin Knizhnik <
k(dot)knizhnik(at)postgrespro(dot)ru> wrote:

> IMHO end-user do not have skills and time to create their own compression
> algorithms. And without knowledge of specific of particular data set,
> it is very hard to implement something more efficient than universal
> compression library.
> But if you think that it is not a right place and time to discuss it, I do
> not insist.
>

For sure, end-users wouldn't implement own compression algorithms.
In the same way as end-users wouldn't implement custom datatypes,
operator classes, procedural language handlers etc. But those are
useful extension mechanisms which pass test of time. And extension
developers use them.

> But in any case, I think that it will be useful to provide some more
> examples of custom compression API usage.
> From my point of view the most useful will be integration with zstd.
> But if it is possible to find some example of data-specific compression
> algorithms which show better results than universal compression,
> it will be even more impressive.
>

Yes, this patch definitely lacks of good usage example. That may
lead to some misunderstanding of its purpose. Good use-cases
should be shown before we can consider committing this. I think
Ildus should try to implement at least custom dictionary compression
method where dictionary is specified by user in parameters.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

From:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>
To:	Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Cc:	Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>, Ildar Musin <i(dot)musin(at)postgrespro(dot)ru>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Евгений Шишкин <itparanoia(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Oleg Bartunov <obartunov(at)gmail(dot)com>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Chapman Flack <chap(at)anastigmatix(dot)net>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2018-06-18 14:30:45
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, 24 Apr 2018 14:05:20 +0300
Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru> wrote:

>
> Yes, this patch definitely lacks of good usage example. That may
> lead to some misunderstanding of its purpose. Good use-cases
> should be shown before we can consider committing this. I think
> Ildus should try to implement at least custom dictionary compression
> method where dictionary is specified by user in parameters.
>

Hi,

attached v16 of the patch. I have splitted the patch to 8 parts so now
it should be easier to make a review. The main improvement is zlib
compression method with dictionary support like you mentioned. My
synthetic tests showed that zlib gives more compression but usually
slower than pglz.

--
---
Ildus Kurbangaliev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Attachment	Content-Type	Size
0001-Make-syntax-changes-for-custom-compression-metho-v16.patch	text/x-patch	9.1 KB
0002-Add-compression-catalog-tables-and-the-basic-inf-v16.patch	text/x-patch	114.0 KB
0003-Add-rewrite-rules-and-tupdesc-flags-v16.patch	text/x-patch	24.4 KB
0004-Add-pglz-compression-method-v16.patch	text/x-patch	7.5 KB
0005-Add-zlib-compression-method-v16.patch	text/x-patch	9.9 KB
0006-Add-psql-pg_dump-and-pg_upgrade-support-v16.patch	text/x-patch	37.5 KB
0007-Add-tests-for-compression-methods-v16.patch	text/x-patch	133.5 KB
0008-Add-documentation-for-custom-compression-methods-v16.patch	text/x-patch	16.1 KB

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>
Cc:	Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>, Ildar Musin <i(dot)musin(at)postgrespro(dot)ru>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Евгений Шишкин <itparanoia(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Oleg Bartunov <obartunov(at)gmail(dot)com>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Chapman Flack <chap(at)anastigmatix(dot)net>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2018-06-19 14:57:37
Message-ID:	CA+TgmoazcTRNWacQ3Lwnrp_u7WyEC3r7o6zZ+rpmfaZez+kOug@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Apr 23, 2018 at 12:34 PM, Konstantin Knizhnik
<k(dot)knizhnik(at)postgrespro(dot)ru> wrote:
> May be. But in any cases, there are several direction where compression can
> be used:
> - custom compression algorithms
> - libpq compression
> - page level compression
> ...
>
> and them should be somehow finally "married" with each other.

I agree that we should try to avoid multiplying the number of
compression-related APIs. Ideally there should be one API for
registering a compression algorithms, and then there can be different
methods of selecting that compression algorithm depending on the
purpose for which it will be used. For instance, you could select a
column compression format using some variant of ALTER TABLE ... ALTER
COLUMN, but you would obviously use some other method to select the
WAL compression format. However, it's a little unclear to me how we
would actually make the idea of a single API work. For column
compression, we need everything to be accessible through the catalogs.
For something like WAL compression, we need it to be completely
independent of the catalogs. Those things are opposites, so a single
API can't have both properties. Maybe there can be some pieces
shared, but as much as I'd like it to be otherwise, it doesn't seem
possible to share it completely.

I also agree with Ildus and Alexander that we cannot and should not
try to solve every problem in one patch. Rather, we should just think
ahead, so that we make as much of what goes into this patch reusable
in the future as we can.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>
To:	Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Cc:	Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>, Ildar Musin <i(dot)musin(at)postgrespro(dot)ru>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Евгений Шишкин <itparanoia(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Oleg Bartunov <obartunov(at)gmail(dot)com>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Chapman Flack <chap(at)anastigmatix(dot)net>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2018-07-02 12:56:24
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, 18 Jun 2018 17:30:45 +0300
Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru> wrote:

> On Tue, 24 Apr 2018 14:05:20 +0300
> Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru> wrote:
>
> >
> > Yes, this patch definitely lacks of good usage example. That may
> > lead to some misunderstanding of its purpose. Good use-cases
> > should be shown before we can consider committing this. I think
> > Ildus should try to implement at least custom dictionary compression
> > method where dictionary is specified by user in parameters.
> >
>
> Hi,
>
> attached v16 of the patch. I have splitted the patch to 8 parts so now
> it should be easier to make a review. The main improvement is zlib
> compression method with dictionary support like you mentioned. My
> synthetic tests showed that zlib gives more compression but usually
> slower than pglz.
>

Hi,
I have noticed that my patch is failing to apply on cputube. Attached a
rebased version of the patch. Nothing have really changed, just added
and fixed some tests for zlib and improved documentation.

--
---
Ildus Kurbangaliev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Attachment	Content-Type	Size
0001-Make-syntax-changes-for-custom-compression-metho-v17.patch	text/x-patch	9.1 KB
0002-Add-compression-catalog-tables-and-the-basic-inf-v17.patch	text/x-patch	114.0 KB
0003-Add-rewrite-rules-and-tupdesc-flags-v17.patch	text/x-patch	24.4 KB
0004-Add-pglz-compression-method-v17.patch	text/x-patch	7.5 KB
0005-Add-zlib-compression-method-v17.patch	text/x-patch	9.9 KB
0006-Add-psql-pg_dump-and-pg_upgrade-support-v17.patch	text/x-patch	37.2 KB
0007-Add-tests-for-compression-methods-v17.patch	text/x-patch	150.6 KB
0008-Add-documentation-for-custom-compression-methods-v17.patch	text/x-patch	17.2 KB

From:	Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
To:	"i(dot)kurbangaliev" <i(dot)kurbangaliev(at)postgrespro(dot)ru>
Cc:	konstantin knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>, Ildar Musin <i(dot)musin(at)postgrespro(dot)ru>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Евгений Шишкин <itparanoia(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Oleg Bartunov <obartunov(at)gmail(dot)com>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Chapman Flack <chap(at)anastigmatix(dot)net>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2018-07-23 13:16:19
Message-ID:	CAPpHfduytTe0kxfDMZLWY=DXm0GBQKy-Km_gFcgpZ2UKAdHR5Q@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi!

On Mon, Jul 2, 2018 at 3:56 PM Ildus Kurbangaliev
<i(dot)kurbangaliev(at)postgrespro(dot)ru> wrote:
> On Mon, 18 Jun 2018 17:30:45 +0300
> Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru> wrote:
>
> > On Tue, 24 Apr 2018 14:05:20 +0300
> > Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru> wrote:
> >
> > >
> > > Yes, this patch definitely lacks of good usage example. That may
> > > lead to some misunderstanding of its purpose. Good use-cases
> > > should be shown before we can consider committing this. I think
> > > Ildus should try to implement at least custom dictionary compression
> > > method where dictionary is specified by user in parameters.
> > >
> >
> > Hi,
> >
> > attached v16 of the patch. I have splitted the patch to 8 parts so now
> > it should be easier to make a review. The main improvement is zlib
> > compression method with dictionary support like you mentioned. My
> > synthetic tests showed that zlib gives more compression but usually
> > slower than pglz.
> >
>
> I have noticed that my patch is failing to apply on cputube. Attached a
> rebased version of the patch. Nothing have really changed, just added
> and fixed some tests for zlib and improved documentation.

I'm going to review this patch. Could you please rebase it? It
doesn't apply for me due to changes made in src/bin/psql/describe.c.

patching file src/bin/psql/describe.c
Hunk #1 FAILED at 1755.
Hunk #2 FAILED at 1887.
Hunk #3 FAILED at 1989.
Hunk #4 FAILED at 2019.
Hunk #5 FAILED at 2030.
5 out of 5 hunks FAILED -- saving rejects to file src/bin/psql/describe.c.rej

Also, please not that PostgreSQL 11 already passed feature freeze some
time ago. So, please adjust your patch to expect PostgreSQL 12 in the
lines like this:

+ if (pset.sversion >= 110000)

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

From:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>
To:	pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: [HACKERS] Custom compression methods
Date:	2018-09-06 15:27:13
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, 23 Jul 2018 16:16:19 +0300
Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru> wrote:

>
> I'm going to review this patch. Could you please rebase it? It
> doesn't apply for me due to changes made in src/bin/psql/describe.c.
>
> patching file src/bin/psql/describe.c
> Hunk #1 FAILED at 1755.
> Hunk #2 FAILED at 1887.
> Hunk #3 FAILED at 1989.
> Hunk #4 FAILED at 2019.
> Hunk #5 FAILED at 2030.
> 5 out of 5 hunks FAILED -- saving rejects to file
> src/bin/psql/describe.c.rej
>
> Also, please not that PostgreSQL 11 already passed feature freeze some
> time ago. So, please adjust your patch to expect PostgreSQL 12 in the
> lines like this:
>
> + if (pset.sversion >= 110000)
>
> ------
> Alexander Korotkov
> Postgres Professional: http://www.postgrespro.com
> The Russian Postgres Company
>

Hi, attached latest set of patches. Rebased and fixed pg_upgrade errors
related with zlib support.

--
---
Ildus Kurbangaliev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Attachment	Content-Type	Size
0001-Make-syntax-changes-for-custom-compression-metho-v19.patch	text/x-patch	9.2 KB
0002-Add-compression-catalog-tables-and-the-basic-inf-v19.patch	text/x-patch	114.5 KB
0003-Add-rewrite-rules-and-tupdesc-flags-v19.patch	text/x-patch	24.4 KB
0004-Add-pglz-compression-method-v19.patch	text/x-patch	7.6 KB
0005-Add-zlib-compression-method-v19.patch	text/x-patch	10.0 KB
0006-Add-psql-pg_dump-and-pg_upgrade-support-v19.patch	text/x-patch	34.8 KB
0007-Add-tests-for-compression-methods-v19.patch	text/x-patch	150.6 KB
0008-Add-documentation-for-custom-compression-methods-v19.patch	text/x-patch	17.3 KB

From:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>
To:	pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: [HACKERS] Custom compression methods
Date:	2018-11-01 11:54:37
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 6 Sep 2018 18:27:13 +0300
Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru> wrote:

>
> Hi, attached latest set of patches. Rebased and fixed pg_upgrade
> errors related with zlib support.
>

Hi, just updated patches to current master. Nothing new.

--
---
Ildus Kurbangaliev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Attachment	Content-Type	Size
0001-Make-syntax-changes-for-custom-compression-metho-v20.patch	text/x-patch	9.2 KB
0002-Add-compression-catalog-tables-and-the-basic-inf-v20.patch	text/x-patch	114.3 KB
0003-Add-rewrite-rules-and-tupdesc-flags-v20.patch	text/x-patch	24.4 KB
0004-Add-pglz-compression-method-v20.patch	text/x-patch	7.6 KB
0005-Add-zlib-compression-method-v20.patch	text/x-patch	10.0 KB
0006-Add-psql-pg_dump-and-pg_upgrade-support-v20.patch	text/x-patch	34.4 KB
0007-Add-tests-for-compression-methods-v20.patch	text/x-patch	150.6 KB
0008-Add-documentation-for-custom-compression-methods-v20.patch	text/x-patch	17.3 KB

From:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
To:	i(dot)kurbangaliev(at)postgrespro(dot)ru, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Cc:	PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2018-11-30 14:08:39
Message-ID:	CA+q6zcUTF97=Ao1C2b_0_6znzZX-eQtSB-KvTwZBV_X+e3StJw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On Thu, Sep 6, 2018 at 5:27 PM Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru> wrote:
>
> Hi, attached latest set of patches. Rebased and fixed pg_upgrade errors
> related with zlib support.

Thank you for working on this patch, I believe the ideas mentioned in this
thread are quite important for Postgres improvement. Unfortunately, patch has
some conflicts now, could you post a rebased version one more time?

> On Mon, 23 Jul 2018 16:16:19 +0300
> Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru> wrote:
>
> > I'm going to review this patch. Could you please rebase it? It
> > doesn't apply for me due to changes made in src/bin/psql/describe.c.

Is there any review underway, could you share the results?

From:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>
To:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
Cc:	Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2018-12-03 12:43:32
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 30 Nov 2018 15:08:39 +0100
Dmitry Dolgov <9erthalion6(at)gmail(dot)com> wrote:

> > On Thu, Sep 6, 2018 at 5:27 PM Ildus Kurbangaliev
> > <i(dot)kurbangaliev(at)postgrespro(dot)ru> wrote:
> >
> > Hi, attached latest set of patches. Rebased and fixed pg_upgrade
> > errors related with zlib support.
>
> Thank you for working on this patch, I believe the ideas mentioned in
> this thread are quite important for Postgres improvement.
> Unfortunately, patch has some conflicts now, could you post a rebased
> version one more time?

Hi, here is a rebased version. I hope it will get some review :)

--
---
Ildus Kurbangaliev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Attachment	Content-Type	Size
0001-Make-syntax-changes-for-custom-compression-metho-v20.patch	text/x-patch	9.2 KB
0002-Add-compression-catalog-tables-and-the-basic-inf-v20.patch	text/x-patch	113.9 KB
0003-Add-rewrite-rules-and-tupdesc-flags-v20.patch	text/x-patch	24.4 KB
0004-Add-pglz-compression-method-v20.patch	text/x-patch	7.6 KB
0005-Add-zlib-compression-method-v20.patch	text/x-patch	10.0 KB
0006-Add-psql-pg_dump-and-pg_upgrade-support-v20.patch	text/x-patch	34.4 KB
0007-Add-tests-for-compression-methods-v20.patch	text/x-patch	147.3 KB
0008-Add-documentation-for-custom-compression-methods-v20.patch	text/x-patch	17.3 KB

From:	Michael Paquier <michael(at)paquier(dot)xyz>
To:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>
Cc:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2019-02-04 05:26:28
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Dec 03, 2018 at 03:43:32PM +0300, Ildus Kurbangaliev wrote:
> Hi, here is a rebased version. I hope it will get some review :)

This patch set is failing to apply, so moved to next CF, waiting for
author.
--
Michael

From:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>
To:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>
Cc:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2019-02-28 15:44:15
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,
there are another set of patches.
Only rebased to current master.

Also I will change status on commitfest to 'Needs review'.

--
Regards,
Ildus Kurbangaliev

Attachment	Content-Type	Size
0001-Make-syntax-changes-for-custom-compression-metho-v21.patch	text/x-patch	9.2 KB
0002-Add-compression-catalog-tables-and-the-basic-inf-v21.patch	text/x-patch	113.9 KB
0003-Add-rewrite-rules-and-tupdesc-flags-v21.patch	text/x-patch	24.4 KB
0004-Add-pglz-compression-method-v21.patch	text/x-patch	7.6 KB
0005-Add-zlib-compression-method-v21.patch	text/x-patch	10.0 KB
0006-Add-psql-pg_dump-and-pg_upgrade-support-v21.patch	text/x-patch	34.4 KB
0007-Add-tests-for-compression-methods-v21.patch	text/x-patch	148.2 KB
0008-Add-documentation-for-custom-compression-methods-v21.patch	text/x-patch	17.3 KB

From:	David Steele <david(at)pgmasters(dot)net>
To:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>
Cc:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Re: [HACKERS] Custom compression methods
Date:	2019-03-07 07:42:41
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2/28/19 5:44 PM, Ildus Kurbangaliev wrote:

> there are another set of patches.
> Only rebased to current master.
>
> Also I will change status on commitfest to 'Needs review'.

This patch has seen periodic rebases but no code review that I can see
since last January 2018.

As Andres noted in [1], I think that we need to decide if this is a
feature that we want rather than just continuing to push it from CF to CF.

--
-David
david(at)pgmasters(dot)net

[1]
https://www.postgresql.org/message-id/20190216054526.zss2cufdxfeudr4i%40alap3.anarazel.de

From:	Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
To:	David Steele <david(at)pgmasters(dot)net>
Cc:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Re: [HACKERS] Custom compression methods
Date:	2019-03-07 07:50:36
Message-ID:	CAPpHfdsQtby_G0HxPMrDA9G=swhC4o0sbm9hZ35A_uP8uw2TEQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Mar 7, 2019 at 10:43 AM David Steele <david(at)pgmasters(dot)net> wrote:

> On 2/28/19 5:44 PM, Ildus Kurbangaliev wrote:
>
> > there are another set of patches.
> > Only rebased to current master.
> >
> > Also I will change status on commitfest to 'Needs review'.
>
> This patch has seen periodic rebases but no code review that I can see
> since last January 2018.
>
> As Andres noted in [1], I think that we need to decide if this is a
> feature that we want rather than just continuing to push it from CF to CF.
>

Yes. I took a look at code of this patch. I think it's in pretty good
shape. But high level review/discussion is required.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

From:	David Steele <david(at)pgmasters(dot)net>
To:	Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Cc:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Re: Re: [HACKERS] Custom compression methods
Date:	2019-03-15 10:07:14
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 3/7/19 11:50 AM, Alexander Korotkov wrote:
> On Thu, Mar 7, 2019 at 10:43 AM David Steele <david(at)pgmasters(dot)net
> <mailto:david(at)pgmasters(dot)net>> wrote:
>
> On 2/28/19 5:44 PM, Ildus Kurbangaliev wrote:
>
> > there are another set of patches.
> > Only rebased to current master.
> >
> > Also I will change status on commitfest to 'Needs review'.
>
> This patch has seen periodic rebases but no code review that I can see
> since last January 2018.
>
> As Andres noted in [1], I think that we need to decide if this is a
> feature that we want rather than just continuing to push it from CF
> to CF.
>
>
> Yes. I took a look at code of this patch. I think it's in pretty good
> shape. But high level review/discussion is required.

OK, but I think this patch can only be pushed one more time, maximum,
before it should be rejected.

Regards,
--
-David
david(at)pgmasters(dot)net

From:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>
To:	David Steele <david(at)pgmasters(dot)net>
Cc:	Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2019-03-15 11:52:03
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 15 Mar 2019 14:07:14 +0400
David Steele <david(at)pgmasters(dot)net> wrote:

> On 3/7/19 11:50 AM, Alexander Korotkov wrote:
> > On Thu, Mar 7, 2019 at 10:43 AM David Steele <david(at)pgmasters(dot)net
> > <mailto:david(at)pgmasters(dot)net>> wrote:
> >
> > On 2/28/19 5:44 PM, Ildus Kurbangaliev wrote:
> >
> > > there are another set of patches.
> > > Only rebased to current master.
> > >
> > > Also I will change status on commitfest to 'Needs review'.
> >
> > This patch has seen periodic rebases but no code review that I
> > can see since last January 2018.
> >
> > As Andres noted in [1], I think that we need to decide if this
> > is a feature that we want rather than just continuing to push it
> > from CF to CF.
> >
> >
> > Yes. I took a look at code of this patch. I think it's in pretty
> > good shape. But high level review/discussion is required.
>
> OK, but I think this patch can only be pushed one more time, maximum,
> before it should be rejected.
>
> Regards,

Hi,
in my opinion this patch is usually skipped not because it is not
needed, but because of its size. It is not hard to maintain it until
commiters will have time for it or I will get actual response that
nobody is going to commit it.

Attached latest set of patches.

--
Best regards,
Ildus Kurbangaliev

Attachment	Content-Type	Size
0001-Make-syntax-changes-for-custom-compression-metho-v22.patch	text/x-patch	8.6 KB
0002-Add-compression-catalog-tables-and-the-basic-inf-v22.patch	text/x-patch	113.7 KB
0003-Add-rewrite-rules-and-tupdesc-flags-v22.patch	text/x-patch	24.3 KB
0004-Add-pglz-compression-method-v22.patch	text/x-patch	7.6 KB
0005-Add-zlib-compression-method-v22.patch	text/x-patch	10.0 KB
0006-Add-psql-pg_dump-and-pg_upgrade-support-v22.patch	text/x-patch	34.4 KB
0007-Add-tests-for-compression-methods-v22.patch	text/x-patch	153.4 KB
0008-Add-documentation-for-custom-compression-methods-v22.patch	text/x-patch	17.3 KB

From:	Chris Travers <chris(dot)travers(at)adjust(dot)com>
To:	David Steele <david(at)pgmasters(dot)net>
Cc:	Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Re: Re: [HACKERS] Custom compression methods
Date:	2019-03-15 12:57:40
Message-ID:	CAN-RpxAC2K+kChSfdS7u-rAc5Zm2UiMQ3CM1h3tRzGwFUr=7Gw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Mar 15, 2019 at 6:07 PM David Steele <david(at)pgmasters(dot)net> wrote:

> On 3/7/19 11:50 AM, Alexander Korotkov wrote:
> > On Thu, Mar 7, 2019 at 10:43 AM David Steele <david(at)pgmasters(dot)net
> > <mailto:david(at)pgmasters(dot)net>> wrote:
> >
> > On 2/28/19 5:44 PM, Ildus Kurbangaliev wrote:
> >
> > > there are another set of patches.
> > > Only rebased to current master.
> > >
> > > Also I will change status on commitfest to 'Needs review'.
> >
> > This patch has seen periodic rebases but no code review that I can
> see
> > since last January 2018.
> >
> > As Andres noted in [1], I think that we need to decide if this is a
> > feature that we want rather than just continuing to push it from CF
> > to CF.
> >
> >
> > Yes. I took a look at code of this patch. I think it's in pretty good
> > shape. But high level review/discussion is required.
>
> OK, but I think this patch can only be pushed one more time, maximum,
> before it should be rejected.
>

As a note, we believe at Adjust that this would be very helpful for some of
our use cases and some other general use cases. I think as a feature,
custom compression methods are a good thing but we are not the only ones
with interests here and would be interested in pushing this forward if
possible or finding ways to contribute to better approaches in this
particular field.

>
> Regards,
> --
> -David
> david(at)pgmasters(dot)net
>
>

--
Best Regards,
Chris Travers
Head of Database

Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com
Saarbrücker Straße 37a, 10405 Berlin

From:	Ildar Musin <ildar(at)adjust(dot)com>
To:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>
Cc:	David Steele <david(at)pgmasters(dot)net>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2019-03-18 16:11:01
Message-ID:	CAONYFtNfWbFEUM1NHGeb0yY6LTsAjTB-H-famR6k4xKf+fCfRQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi Ildus,

On Fri, Mar 15, 2019 at 12:52 PM Ildus Kurbangaliev <
i(dot)kurbangaliev(at)gmail(dot)com> wrote:

>
> Hi,
> in my opinion this patch is usually skipped not because it is not
> needed, but because of its size. It is not hard to maintain it until
> commiters will have time for it or I will get actual response that
> nobody is going to commit it.
>
> Attached latest set of patches.
>
>
As I understand, the only thing changed since my last review is an
additional
compression method for zlib.

The code looks good. I have one suggestion though. Currently you only
predefine
two compression levels: `best_speed` and `best_compression`. But zlib itself
allows a fine gradation between those two. It is possible to set level to
the
values from 0 to 9 (where zero means no compression at all which I guess
isn't
useful in our case). So I think we should allow user choose between either
textual representation (as you already did) or numeral. Another thing is
that one
can specify, for instance, `best_speed` level, but not `BEST_SPEED`, which
can
be a bit frustrating for user.

Regards,
Ildar Musin

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>, David Steele <david(at)pgmasters(dot)net>
Cc:	Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2019-03-18 22:08:52
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 3/15/19 12:52 PM, Ildus Kurbangaliev wrote:
> On Fri, 15 Mar 2019 14:07:14 +0400
> David Steele <david(at)pgmasters(dot)net> wrote:
>
>> On 3/7/19 11:50 AM, Alexander Korotkov wrote:
>>> On Thu, Mar 7, 2019 at 10:43 AM David Steele <david(at)pgmasters(dot)net
>>> <mailto:david(at)pgmasters(dot)net>> wrote:
>>>
>>> On 2/28/19 5:44 PM, Ildus Kurbangaliev wrote:
>>>
>>> > there are another set of patches.
>>> > Only rebased to current master.
>>> >
>>> > Also I will change status on commitfest to 'Needs review'.
>>>
>>> This patch has seen periodic rebases but no code review that I
>>> can see since last January 2018.
>>>
>>> As Andres noted in [1], I think that we need to decide if this
>>> is a feature that we want rather than just continuing to push it
>>> from CF to CF.
>>>
>>>
>>> Yes. I took a look at code of this patch. I think it's in pretty
>>> good shape. But high level review/discussion is required.
>>
>> OK, but I think this patch can only be pushed one more time, maximum,
>> before it should be rejected.
>>
>> Regards,
>
> Hi,
> in my opinion this patch is usually skipped not because it is not
> needed, but because of its size. It is not hard to maintain it until
> commiters will have time for it or I will get actual response that
> nobody is going to commit it.
>

That may be one of the reasons, yes. But there are other reasons, which
I think may be playing a bigger role.

There's one practical issue with how the patch is structured - the docs
and tests are in separate patches towards the end of the patch series,
which makes it impossible to commit the preceding parts. This needs to
change. Otherwise the patch size kills the patch as a whole.

But there's a more important cost/benefit issue, I think. When I look at
patches as a committer, I naturally have to weight how much time I spend
on getting it in (and then dealing with fallout from bugs etc) vs. what
I get in return (measured in benefits for community, users). This patch
is pretty large and complex, so the "costs" are quite high, while the
benefits from the patch itself is the ability to pick between pg_lz and
zlib. Which is not great, and so people tend to pick other patches.

Now, I understand there's a lot of potential benefits further down the
line, like column-level compression (which I think is the main goal
here). But that's not included in the patch, so the gains are somewhat
far in the future.

But hey, I think there are committers working for postgrespro, who might
have the motivation to get this over the line. Of course, assuming that
there are no serious objections to having this functionality or how it's
implemented ... But I don't think that was the case.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Chris Travers <chris(dot)travers(at)adjust(dot)com>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>, David Steele <david(at)pgmasters(dot)net>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2019-03-19 09:59:38
Message-ID:	CAN-RpxDFZUduYmOociaYHsnh3vw2E8wjsQ+Ht1xRvemp3H=6WA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Mar 18, 2019 at 11:09 PM Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
wrote:

>
>
> On 3/15/19 12:52 PM, Ildus Kurbangaliev wrote:
> > On Fri, 15 Mar 2019 14:07:14 +0400
> > David Steele <david(at)pgmasters(dot)net> wrote:
> >
> >> On 3/7/19 11:50 AM, Alexander Korotkov wrote:
> >>> On Thu, Mar 7, 2019 at 10:43 AM David Steele <david(at)pgmasters(dot)net
> >>> <mailto:david(at)pgmasters(dot)net>> wrote:
> >>>
> >>> On 2/28/19 5:44 PM, Ildus Kurbangaliev wrote:
> >>>
> >>> > there are another set of patches.
> >>> > Only rebased to current master.
> >>> >
> >>> > Also I will change status on commitfest to 'Needs review'.
> >>>
> >>> This patch has seen periodic rebases but no code review that I
> >>> can see since last January 2018.
> >>>
> >>> As Andres noted in [1], I think that we need to decide if this
> >>> is a feature that we want rather than just continuing to push it
> >>> from CF to CF.
> >>>
> >>>
> >>> Yes. I took a look at code of this patch. I think it's in pretty
> >>> good shape. But high level review/discussion is required.
> >>
> >> OK, but I think this patch can only be pushed one more time, maximum,
> >> before it should be rejected.
> >>
> >> Regards,
> >
> > Hi,
> > in my opinion this patch is usually skipped not because it is not
> > needed, but because of its size. It is not hard to maintain it until
> > commiters will have time for it or I will get actual response that
> > nobody is going to commit it.
> >
>
> That may be one of the reasons, yes. But there are other reasons, which
> I think may be playing a bigger role.
>
> There's one practical issue with how the patch is structured - the docs
> and tests are in separate patches towards the end of the patch series,
> which makes it impossible to commit the preceding parts. This needs to
> change. Otherwise the patch size kills the patch as a whole.
>
> But there's a more important cost/benefit issue, I think. When I look at
> patches as a committer, I naturally have to weight how much time I spend
> on getting it in (and then dealing with fallout from bugs etc) vs. what
> I get in return (measured in benefits for community, users). This patch
> is pretty large and complex, so the "costs" are quite high, while the
> benefits from the patch itself is the ability to pick between pg_lz and
> zlib. Which is not great, and so people tend to pick other patches.
>
> Now, I understand there's a lot of potential benefits further down the
> line, like column-level compression (which I think is the main goal
> here). But that's not included in the patch, so the gains are somewhat
> far in the future.
>

Not discussing whether any particular committer should pick this up but I
want to discuss an important use case we have at Adjust for this sort of
patch.

The PostgreSQL compression strategy is something we find inadequate for at
least one of our large deployments (a large debug log spanning 10PB+). Our
current solution is to set storage so that it does not compress and then
run on ZFS to get compression speedups on spinning disks.

But running PostgreSQL on ZFS has some annoying costs because we have
copy-on-write on copy-on-write, and when you add file fragmentation... I
would really like to be able to get away from having to do ZFS as an
underlying filesystem. While we have good write throughput, read
throughput is not as good as I would like.

An approach that would give us better row-level compression would allow us
to ditch the COW filesystem under PostgreSQL approach.

So I think the benefits are actually quite high particularly for those
dealing with volume/variety problems where things like JSONB might be a
go-to solution. Similarly I could totally see having systems which handle
large amounts of specialized text having extensions for dealing with these.

While I am not currently able to speak for questions of how it is
implemented, I can say with very little doubt that we would almost
certainly use this functionality if it were there and I could see plenty of
other cases where this would be a very appropriate direction for some other
projects as well.

> regards
>
> --
> Tomas Vondra http://www.2ndQuadrant.com
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>
>

--
Best Regards,
Chris Travers
Head of Database

Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com
Saarbrücker Straße 37a, 10405 Berlin

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	Chris Travers <chris(dot)travers(at)adjust(dot)com>
Cc:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>, David Steele <david(at)pgmasters(dot)net>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2019-03-19 11:19:53
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 3/19/19 10:59 AM, Chris Travers wrote:
>
>
> On Mon, Mar 18, 2019 at 11:09 PM Tomas Vondra
> <tomas(dot)vondra(at)2ndquadrant(dot)com <mailto:tomas(dot)vondra(at)2ndquadrant(dot)com>> wrote:
>
>
>
> On 3/15/19 12:52 PM, Ildus Kurbangaliev wrote:
> > On Fri, 15 Mar 2019 14:07:14 +0400
> > David Steele <david(at)pgmasters(dot)net <mailto:david(at)pgmasters(dot)net>> wrote:
> >
> >> On 3/7/19 11:50 AM, Alexander Korotkov wrote:
> >>> On Thu, Mar 7, 2019 at 10:43 AM David Steele
> <david(at)pgmasters(dot)net <mailto:david(at)pgmasters(dot)net>
> >>> <mailto:david(at)pgmasters(dot)net <mailto:david(at)pgmasters(dot)net>>> wrote:
> >>>
> >>> On 2/28/19 5:44 PM, Ildus Kurbangaliev wrote:
> >>>
> >>> > there are another set of patches.
> >>> > Only rebased to current master.
> >>> >
> >>> > Also I will change status on commitfest to 'Needs review'.
> >>>
> >>> This patch has seen periodic rebases but no code review that I
> >>> can see since last January 2018.
> >>>
> >>> As Andres noted in [1], I think that we need to decide if this
> >>> is a feature that we want rather than just continuing to push it
> >>> from CF to CF.
> >>>
> >>>
> >>> Yes. I took a look at code of this patch. I think it's in pretty
> >>> good shape. But high level review/discussion is required.
> >>
> >> OK, but I think this patch can only be pushed one more time,
> maximum,
> >> before it should be rejected.
> >>
> >> Regards,
> >
> > Hi,
> > in my opinion this patch is usually skipped not because it is not
> > needed, but because of its size. It is not hard to maintain it until
> > commiters will have time for it or I will get actual response that
> > nobody is going to commit it.
> >
>
> That may be one of the reasons, yes. But there are other reasons, which
> I think may be playing a bigger role.
>
> There's one practical issue with how the patch is structured - the docs
> and tests are in separate patches towards the end of the patch series,
> which makes it impossible to commit the preceding parts. This needs to
> change. Otherwise the patch size kills the patch as a whole.
>
> But there's a more important cost/benefit issue, I think. When I look at
> patches as a committer, I naturally have to weight how much time I spend
> on getting it in (and then dealing with fallout from bugs etc) vs. what
> I get in return (measured in benefits for community, users). This patch
> is pretty large and complex, so the "costs" are quite high, while the
> benefits from the patch itself is the ability to pick between pg_lz and
> zlib. Which is not great, and so people tend to pick other patches.
>
> Now, I understand there's a lot of potential benefits further down the
> line, like column-level compression (which I think is the main goal
> here). But that's not included in the patch, so the gains are somewhat
> far in the future.
>
>
> Not discussing whether any particular committer should pick this up but
> I want to discuss an important use case we have at Adjust for this sort
> of patch.
>
> The PostgreSQL compression strategy is something we find inadequate for
> at least one of our large deployments (a large debug log spanning
> 10PB+). Our current solution is to set storage so that it does not
> compress and then run on ZFS to get compression speedups on spinning disks.
>
> But running PostgreSQL on ZFS has some annoying costs because we have
> copy-on-write on copy-on-write, and when you add file fragmentation... I
> would really like to be able to get away from having to do ZFS as an
> underlying filesystem. While we have good write throughput, read
> throughput is not as good as I would like.
>
> An approach that would give us better row-level compression would allow
> us to ditch the COW filesystem under PostgreSQL approach.
>
> So I think the benefits are actually quite high particularly for those
> dealing with volume/variety problems where things like JSONB might be a
> go-to solution. Similarly I could totally see having systems which
> handle large amounts of specialized text having extensions for dealing
> with these.
>

Sure, I don't disagree - the proposed compression approach may be a big
win for some deployments further down the road, no doubt about it. But
as I said, it's unclear when we get there (or if the interesting stuff
will be in some sort of extension, which I don't oppose in principle).

>
> But hey, I think there are committers working for postgrespro, who might
> have the motivation to get this over the line. Of course, assuming that
> there are no serious objections to having this functionality or how it's
> implemented ... But I don't think that was the case.
>
>
> While I am not currently able to speak for questions of how it is
> implemented, I can say with very little doubt that we would almost
> certainly use this functionality if it were there and I could see plenty
> of other cases where this would be a very appropriate direction for some
> other projects as well.
>
Well, I guess the best thing you can do to move this patch forward is to
actually try that on your real-world use case, and report your results
and possibly do a review of the patch.

IIRC there was an extension [1] leveraging this custom compression
interface for better jsonb compression, so perhaps that would work for
you (not sure if it's up to date with the current patch, though).

[1]
https://www.postgresql.org/message-id/20171130182009.1b492eb2%40wp.localdomain

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Chris Travers <chris(dot)travers(at)adjust(dot)com>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>, David Steele <david(at)pgmasters(dot)net>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2019-03-19 15:44:40
Message-ID:	CAN-RpxBuBvFFp-Ynq6y6ehwo1VKt-VtHgM0NTmxhQCXUiaqdjQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Mar 19, 2019 at 12:19 PM Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
wrote:

>
> On 3/19/19 10:59 AM, Chris Travers wrote:
> >
> >
> > Not discussing whether any particular committer should pick this up but
> > I want to discuss an important use case we have at Adjust for this sort
> > of patch.
> >
> > The PostgreSQL compression strategy is something we find inadequate for
> > at least one of our large deployments (a large debug log spanning
> > 10PB+). Our current solution is to set storage so that it does not
> > compress and then run on ZFS to get compression speedups on spinning
> disks.
> >
> > But running PostgreSQL on ZFS has some annoying costs because we have
> > copy-on-write on copy-on-write, and when you add file fragmentation... I
> > would really like to be able to get away from having to do ZFS as an
> > underlying filesystem. While we have good write throughput, read
> > throughput is not as good as I would like.
> >
> > An approach that would give us better row-level compression would allow
> > us to ditch the COW filesystem under PostgreSQL approach.
> >
> > So I think the benefits are actually quite high particularly for those
> > dealing with volume/variety problems where things like JSONB might be a
> > go-to solution. Similarly I could totally see having systems which
> > handle large amounts of specialized text having extensions for dealing
> > with these.
> >
>
> Sure, I don't disagree - the proposed compression approach may be a big
> win for some deployments further down the road, no doubt about it. But
> as I said, it's unclear when we get there (or if the interesting stuff
> will be in some sort of extension, which I don't oppose in principle).
>

I would assume that if extensions are particularly stable and useful they
could be moved into core.

But I would also assume that at first, this area would be sufficiently
experimental that folks (like us) would write our own extensions for it.

>
> >
> > But hey, I think there are committers working for postgrespro, who
> might
> > have the motivation to get this over the line. Of course, assuming
> that
> > there are no serious objections to having this functionality or how
> it's
> > implemented ... But I don't think that was the case.
> >
> >
> > While I am not currently able to speak for questions of how it is
> > implemented, I can say with very little doubt that we would almost
> > certainly use this functionality if it were there and I could see plenty
> > of other cases where this would be a very appropriate direction for some
> > other projects as well.
> >
> Well, I guess the best thing you can do to move this patch forward is to
> actually try that on your real-world use case, and report your results
> and possibly do a review of the patch.
>

Yeah, I expect to do this within the next month or two.

>
> IIRC there was an extension [1] leveraging this custom compression
> interface for better jsonb compression, so perhaps that would work for
> you (not sure if it's up to date with the current patch, though).
>
> [1]
>
> https://www.postgresql.org/message-id/20171130182009.1b492eb2%40wp.localdomain
>
> Yeah I will be looking at a couple different approaches here and reporting
back. I don't expect it will be a full production workload but I do expect
to be able to report on benchmarks in both storage and performance.

>
> regards
>
> --
> Tomas Vondra http://www.2ndQuadrant.com
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>

--
Best Regards,
Chris Travers
Head of Database

Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com
Saarbrücker Straße 37a, 10405 Berlin

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	Chris Travers <chris(dot)travers(at)adjust(dot)com>
Cc:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>, David Steele <david(at)pgmasters(dot)net>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2019-03-21 19:59:56
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 3/19/19 4:44 PM, Chris Travers wrote:
>
>
> On Tue, Mar 19, 2019 at 12:19 PM Tomas Vondra
> <tomas(dot)vondra(at)2ndquadrant(dot)com <mailto:tomas(dot)vondra(at)2ndquadrant(dot)com>> wrote:
>
>
> On 3/19/19 10:59 AM, Chris Travers wrote:
> >
> >
> > Not discussing whether any particular committer should pick this
> up but
> > I want to discuss an important use case we have at Adjust for this
> sort
> > of patch.
> >
> > The PostgreSQL compression strategy is something we find
> inadequate for
> > at least one of our large deployments (a large debug log spanning
> > 10PB+). Our current solution is to set storage so that it does not
> > compress and then run on ZFS to get compression speedups on
> spinning disks.
> >
> > But running PostgreSQL on ZFS has some annoying costs because we have
> > copy-on-write on copy-on-write, and when you add file
> fragmentation... I
> > would really like to be able to get away from having to do ZFS as an
> > underlying filesystem. While we have good write throughput, read
> > throughput is not as good as I would like.
> >
> > An approach that would give us better row-level compression would
> allow
> > us to ditch the COW filesystem under PostgreSQL approach.
> >
> > So I think the benefits are actually quite high particularly for those
> > dealing with volume/variety problems where things like JSONB might
> be a
> > go-to solution. Similarly I could totally see having systems which
> > handle large amounts of specialized text having extensions for dealing
> > with these.
> >
>
> Sure, I don't disagree - the proposed compression approach may be a big
> win for some deployments further down the road, no doubt about it. But
> as I said, it's unclear when we get there (or if the interesting stuff
> will be in some sort of extension, which I don't oppose in principle).
>
>
> I would assume that if extensions are particularly stable and useful
> they could be moved into core.
>
> But I would also assume that at first, this area would be sufficiently
> experimental that folks (like us) would write our own extensions for it.
>
>
>
> >
> > But hey, I think there are committers working for postgrespro,
> who might
> > have the motivation to get this over the line. Of course,
> assuming that
> > there are no serious objections to having this functionality
> or how it's
> > implemented ... But I don't think that was the case.
> >
> >
> > While I am not currently able to speak for questions of how it is
> > implemented, I can say with very little doubt that we would almost
> > certainly use this functionality if it were there and I could see
> plenty
> > of other cases where this would be a very appropriate direction
> for some
> > other projects as well.
> >
> Well, I guess the best thing you can do to move this patch forward is to
> actually try that on your real-world use case, and report your results
> and possibly do a review of the patch.
>
>
> Yeah, I expect to do this within the next month or two.
>
>
>
> IIRC there was an extension [1] leveraging this custom compression
> interface for better jsonb compression, so perhaps that would work for
> you (not sure if it's up to date with the current patch, though).
>
> [1]
> https://www.postgresql.org/message-id/20171130182009.1b492eb2%40wp.localdomain
>
> Yeah I will be looking at a couple different approaches here and
> reporting back. I don't expect it will be a full production workload but
> I do expect to be able to report on benchmarks in both storage and
> performance.
>

FWIW I was a bit curious how would that jsonb compression affect the
data set I'm using for testing jsonpath patches, so I spent a bit of
time getting it to work with master. It attached patch gets it to
compile, but unfortunately then it fails like this:

ERROR: jsonbd: worker has detached

It seems there's some bug in how sh_mq is used, but I don't have time
investigate that further.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment	Content-Type	Size
jsonbd-master-fix.patch	text/x-patch	7.6 KB

From:	Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>
Cc:	David Steele <david(at)pgmasters(dot)net>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2019-07-01 10:21:46
Message-ID:	CA+hUKGJ921L3q+sFkEqO2_YRpSpCPqvs=J3Wz41absZn0ayuaQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Mar 16, 2019 at 12:52 AM Ildus Kurbangaliev
<i(dot)kurbangaliev(at)gmail(dot)com> wrote:
> in my opinion this patch is usually skipped not because it is not
> needed, but because of its size. It is not hard to maintain it until
> commiters will have time for it or I will get actual response that
> nobody is going to commit it.

Hi Ildus,

To maximise the chances of more review in the new Commitfest that is
about to begin, could you please send a fresh rebase? This doesn't
apply anymore.

Thanks,

--
Thomas Munro
https://enterprisedb.com

From:	Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
To:	Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>, David Steele <david(at)pgmasters(dot)net>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2019-07-01 10:47:56
Message-ID:	CAPpHfdu-vBCXWpcUJExzfOXH9qsEiqZSr35uJjvF2+94=moMHw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi, Thomas!

On Mon, Jul 1, 2019 at 1:22 PM Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:
>
> On Sat, Mar 16, 2019 at 12:52 AM Ildus Kurbangaliev
> <i(dot)kurbangaliev(at)gmail(dot)com> wrote:
> > in my opinion this patch is usually skipped not because it is not
> > needed, but because of its size. It is not hard to maintain it until
> > commiters will have time for it or I will get actual response that
> > nobody is going to commit it.
>
> To maximise the chances of more review in the new Commitfest that is
> about to begin, could you please send a fresh rebase? This doesn't
> apply anymore.

As I get we're currently need to make high-level decision of whether
we need this [1]. I was going to bring this topic up at last PGCon,
but I didn't manage to attend. Does it worth bothering Ildus with
continuous rebasing assuming we don't have this high-level decision
yet?

Links
1. https://www.postgresql.org/message-id/20190216054526.zss2cufdxfeudr4i%40alap3.anarazel.de

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

From:	Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To:	Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Cc:	Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>, David Steele <david(at)pgmasters(dot)net>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2019-07-01 14:51:36
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2019-Jul-01, Alexander Korotkov wrote:

> As I get we're currently need to make high-level decision of whether
> we need this [1]. I was going to bring this topic up at last PGCon,
> but I didn't manage to attend. Does it worth bothering Ildus with
> continuous rebasing assuming we don't have this high-level decision
> yet?

I agree that having to constantly rebase a patch that doesn't get acted
upon is a bit pointless. I see a bit of a process problem here: if the
patch doesn't apply, it gets punted out of commitfest and reviewers
don't look at it. This means the discussion goes unseen and no
decisions are made. My immediate suggestion is to rebase even if other
changes are needed. Longer-term I think it'd be useful to have patches
marked as needing "high-level decisions" that may lag behind current
master; maybe we have them provide a git commit-ID on top of which the
patch applies cleanly.

I recently found git-imerge which can make rebasing of large patch
series easier, by letting you deal with smaller conflicts one step at a
time rather than one giant conflict; it may prove useful.

--
Álvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
To:	Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc:	Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>, David Steele <david(at)pgmasters(dot)net>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2019-07-01 15:28:12
Message-ID:	CAPpHfdtjDWBLNpaubsdByLqWJ0MfSLzGAHwXuP4SW0JRiPo3eQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Jul 1, 2019 at 5:51 PM Alvaro Herrera <alvherre(at)2ndquadrant(dot)com> wrote:
> On 2019-Jul-01, Alexander Korotkov wrote:
>
> > As I get we're currently need to make high-level decision of whether
> > we need this [1]. I was going to bring this topic up at last PGCon,
> > but I didn't manage to attend. Does it worth bothering Ildus with
> > continuous rebasing assuming we don't have this high-level decision
> > yet?
>
> I agree that having to constantly rebase a patch that doesn't get acted
> upon is a bit pointless. I see a bit of a process problem here: if the
> patch doesn't apply, it gets punted out of commitfest and reviewers
> don't look at it. This means the discussion goes unseen and no
> decisions are made. My immediate suggestion is to rebase even if other
> changes are needed.

OK, let's do this assuming Ildus didn't give up yet :)

> Longer-term I think it'd be useful to have patches
> marked as needing "high-level decisions" that may lag behind current
> master; maybe we have them provide a git commit-ID on top of which the
> patch applies cleanly.

+1,
Sounds like good approach for me.

> I recently found git-imerge which can make rebasing of large patch
> series easier, by letting you deal with smaller conflicts one step at a
> time rather than one giant conflict; it may prove useful.

Thank you for pointing, will try.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

From:	Ildus K <i(dot)kurbangaliev(at)gmail(dot)com>
To:	Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Cc:	Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, David Steele <david(at)pgmasters(dot)net>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2019-07-02 13:05:42
Message-ID:	CAGRT6+Nnq8ZsU6wk9+Vhw8w7YRQDn67hkhWOvAP48gSYiExSzg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, 1 Jul 2019 at 17:28, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
wrote:

> On Mon, Jul 1, 2019 at 5:51 PM Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
> wrote:
> > On 2019-Jul-01, Alexander Korotkov wrote:
> >
> > > As I get we're currently need to make high-level decision of whether
> > > we need this [1]. I was going to bring this topic up at last PGCon,
> > > but I didn't manage to attend. Does it worth bothering Ildus with
> > > continuous rebasing assuming we don't have this high-level decision
> > > yet?
> >
> > I agree that having to constantly rebase a patch that doesn't get acted
> > upon is a bit pointless. I see a bit of a process problem here: if the
> > patch doesn't apply, it gets punted out of commitfest and reviewers
> > don't look at it. This means the discussion goes unseen and no
> > decisions are made. My immediate suggestion is to rebase even if other
> > changes are needed.
>
> OK, let's do this assuming Ildus didn't give up yet :)
>

No, I still didn't give up :)
I'm going to post rebased version in few days. I found that are new
conflicts with
a slice decompression, not sure how to figure out them for now.

Also I was thinking maybe there is a point to add possibility to compress
any data
that goes to some column despite toast threshold size. In our company we
have
types that could benefit from compression even on smallest blocks.

Since pluggable storages were committed I think I should notice that
compression
methods also can be used by them and are not supposed to work only with
toast tables.
Basically it's just an interface to call compression functions which are
related with some column.

Best regards,
Ildus Kurbangaliev

From:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>
To:	Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Cc:	Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, David Steele <david(at)pgmasters(dot)net>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2019-07-08 12:00:21
Message-ID:	CAGRT6+NERrRZXpnTAqr+X6jF7ywRmt7nixD+ivp2g7_rGUaVsg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Attached latest version of the patch.
Added slice decompression function to the compression handler.
Based on: 6b8548964bccd0f2e65c687d591b7345d5146bfa

Best regards,
Ildus Kurbangaliev

On Tue, 2 Jul 2019 at 15:05, Ildus K <i(dot)kurbangaliev(at)gmail(dot)com> wrote:
>
> On Mon, 1 Jul 2019 at 17:28, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru> wrote:
>>
>> On Mon, Jul 1, 2019 at 5:51 PM Alvaro Herrera <alvherre(at)2ndquadrant(dot)com> wrote:
>> > On 2019-Jul-01, Alexander Korotkov wrote:
>> >
>> > > As I get we're currently need to make high-level decision of whether
>> > > we need this [1]. I was going to bring this topic up at last PGCon,
>> > > but I didn't manage to attend. Does it worth bothering Ildus with
>> > > continuous rebasing assuming we don't have this high-level decision
>> > > yet?
>> >
>> > I agree that having to constantly rebase a patch that doesn't get acted
>> > upon is a bit pointless. I see a bit of a process problem here: if the
>> > patch doesn't apply, it gets punted out of commitfest and reviewers
>> > don't look at it. This means the discussion goes unseen and no
>> > decisions are made. My immediate suggestion is to rebase even if other
>> > changes are needed.
>>
>> OK, let's do this assuming Ildus didn't give up yet :)
>
>
> No, I still didn't give up :)
> I'm going to post rebased version in few days. I found that are new conflicts with
> a slice decompression, not sure how to figure out them for now.
>
> Also I was thinking maybe there is a point to add possibility to compress any data
> that goes to some column despite toast threshold size. In our company we have
> types that could benefit from compression even on smallest blocks.
>
> Since pluggable storages were committed I think I should notice that compression
> methods also can be used by them and are not supposed to work only with toast tables.
> Basically it's just an interface to call compression functions which are related with some column.
>
> Best regards,
> Ildus Kurbangaliev

Attachment	Content-Type	Size
custom_compression_methods_v23.patch	text/x-patch	346.5 KB

From:	Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>
Cc:	Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, David Steele <david(at)pgmasters(dot)net>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2019-09-25 20:20:46
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

The compile of this one has been broken for a long time. Is there a
rebase happening?

--
Álvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Cc:	David Steele <david(at)pgmasters(dot)net>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Re: [HACKERS] Custom compression methods
Date:	2020-06-19 17:03:02
Message-ID:	CA+TgmobSDVgUage9qQ5P_=F_9jaMkCgyKxUQGtFQU7oN4kX-AA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Mar 7, 2019 at 2:51 AM Alexander Korotkov
<a(dot)korotkov(at)postgrespro(dot)ru> wrote:
> Yes. I took a look at code of this patch. I think it's in pretty good shape. But high level review/discussion is required.

I agree that the code of this patch is in pretty good shape, although
there is a lot of rebasing needed at this point. Here is an attempt at
some high level review and discussion:

- As far as I can see, there is broad agreement that we shouldn't
consider ourselves to be locked into 'pglz' forever. I believe
numerous people have reported that there are other methods of doing
compression that either compress better, or compress faster, or
decompress faster, or all of the above. This isn't surprising and nor
is it a knock on 'pglz'; Jan committed it in 1999, and it's not
surprising that in 20 years some people have come up with better
ideas. Not only that, but the quantity and quality of open source
software that is available for this kind of thing and for many other
kinds of things have improved dramatically in that time.

- I can see three possible ways of breaking our dependence on 'pglz'
for TOAST compression. Option #1 is to pick one new algorithm which we
think is better than 'pglz' in all relevant ways and use it as the
default for all new compressed datums. This would be dramatically
simpler than what this patch does, because there would be no user
interface. It would just be out with the old and in with the new.
Option #2 is to create a short list of new algorithms that have
different trade-offs; e.g. one that is very fast (like lz4) and one
that has an extremely high compression ratio, and provide an interface
for users to choose between them. This would be moderately simpler
than what this patch does, because we would expose to the user
anything about how a new compression method could be added, but it
would still require a UI for the user to choose between the available
(and hard-coded) options. It has the further advantage that every
PostgreSQL cluster will offer the same options (or a subset of them,
perhaps, depending on configure flags) and so you don't have to worry
that, say, a pg_am row gets lost and suddenly all of your toasted data
is inaccessible and uninterpretable. Option #3 is to do what this
patch actually does, which is to allow for the addition of any number
of compressors, including by extensions. It has the advantage that new
compressors can be added with core's permission, so, for example, if
it is unclear whether some excellent compressor is free of patent
problems, we can elect not to ship support for it in core, while at
the same time people who are willing to accept the associated legal
risk can add that functionality to their own copy as an extension
without having to patch core. The legal climate may even vary by
jurisdiction, so what might be questionable in country A might be
clearly just fine in country B. Aside from those issues, this approach
allows people to experiment and innovate outside of core relatively
quickly, instead of being bound by the somewhat cumbrous development
process which has left this patch in limbo for the last few years. My
view is that option #1 is likely to be impractical, because getting
people to agree is hard, and better things are likely to come along
later, and people like options. So I prefer either #2 or #3.

- The next question is how a datum compressed with some non-default
method should be represented on disk. The patch handles this first of
all by making the observation that the compressed size can't be >=1GB,
because the uncompressed size can't be >=1GB, and we wouldn't have
stored it compressed if it expanded. Therefore, the upper two bits of
the compressed size should always be zero on disk, and the patch
steals one of them to indicate whether "custom" compression is in use.
If it is, the 4-byte varlena header is followed not only by a 4-byte
size (now with the new flag bit also included) but also by a 4-byte
OID, indicating the compression AM in use. I don't think this is a
terrible approach, but I don't think it's amazing, either. 4 bytes is
quite a bit to use for this; if I guess correctly what will be a
typical cluster configuration, you probably would really only need
about 2 bits. For a datum that is both stored externally and
compressed, the overhead is likely negligible, because the length is
probably measured in kB or MB. But for a datum that is compressed but
not stored externally, it seems pretty expensive; the datum is
probably short, and having an extra 4 bytes of uncompressible data
kinda sucks. One possibility would be to allow only one byte here:
require each compression AM that is installed to advertise a one-byte
value that will denote its compressed datums. If more than one AM
tries to claim the same byte value, complain. Another possibility is
to abandon this approach and go with #2 from the previous paragraph.
Or maybe we add 1 or 2 "privileged" built-in compressors that get
dedicated bit-patterns in the upper 2 bits of the size field, with the
last bit pattern being reserved for future algorithms. (e.g. 0x00 =
pglz, 0x01 = lz4, 0x10 = zstd, 0x11 = something else - see within for
details).

- I don't really like the use of the phrase "custom compression". I
think the terminology needs to be rethought so that we just talk about
compression methods. Perhaps in certain contexts we need to specify
that we mean extensible compression methods or user-provided
compression methods or something like that, but I don't think the word
"custom" is very well-suited here. The main point of this shouldn't be
for every cluster in the universe to use a different approach to
compression, or to compress columns within a database in 47 different
ways, but rather to help us get out from under 'pglz'. Eventually we
probably want to change the default, but as the patch phrases things
now, that default would be a custom method, which is almost a
contradiction in terms.

- Yet another possible approach to the on-disk format is to leave
varatt_external.va_extsize and varattrib_4b.rawsize untouched and
instead add new compression methods by adding new vartag_external
values. There's quite a lot of bit-space available there: we have a
whole byte, and we're currently only using 4 values. We could easily
add a half-dozen new possibilities there for new compression methods
without sweating the bit-space consumption. The main thing I don't
like about this is that it only seems like a useful way to provide for
out-of-line compression. Perhaps it could be generalized to allow for
inline compression as well, but it seems like it would take some
hacking.

- One thing I really don't like about the patch is that it consumes a
bit from infomask2 for a new flag HEAP_HASCUSTOMCOMPRESSED. infomask
bits are at a premium, and there's been no real progress in the decade
plus that I've been hanging around here in clawing back any bit-space.
I think we really need to avoid burning our remaining bits for
anything other than a really critical need, and I don't think I
understand what the need is in this case. I might be missing
something, but I'd really strongly suggest looking for a way to get
rid of this. It also invents the concept of a TupleDesc flag, and the
flag invented is TD_ATTR_CUSTOM_COMPRESSED; I'm not sure I see why we
need that, either.

- It seems like this kind of approach has a sort of built-in
circularity problem. It means that every place that might need to
detoast a datum needs to be able to access the pg_am catalog. I wonder
if that's actually true. For instance, consider logical decoding. I
guess that can do catalog lookups in general, but can it do them from
the places where detoasting is happening? Moreover, can it do them
with the right snapshot? Suppose we rewrite a table to change the
compression method, then drop the old compression method, then try to
decode a transaction that modified that table before those operations
were performed. As an even more extreme example, suppose we need to
open pg_am, and to do that we have to build a relcache entry for it,
and suppose the relevant pg_class entry had a relacl or reloptions
field that happened to be custom-compressed. Or equally suppose that
any of the various other tables we use when building a relcache entry
had the same kind of problem, especially those that have TOAST tables.
We could just disallow the use of non-default compressors in the
system catalogs, but the benefits mentioned in
http://postgr.es/m/[email protected] seem too large to
ignore.

- I think it would be awfully appealing if we could find some way of
dividing this great big patch into some somewhat smaller patches. For
example:

Patch #1. Add syntax allowing a compression method to be specified,
but the only possible choice is pglz, and the PRESERVE stuff isn't
supported, and changing the value associated with an existing column
isn't supported, but we can add tab-completion support and stuff.

Patch #2. Add a second built-in method, like gzip or lz4.

Patch #3. Add support for changing the compression method associated
with a column, forcing a table rewrite.

Patch #4. Add support for PRESERVE, so that you can change the
compression method associated with a column without forcing a table
rewrite, by including the old method in the PRESERVE list, or with a
rewrite, by not including it in the PRESERVE list.

Patch #5. Add support for compression methods via the AM interface.
Perhaps methods added in this manner are prohibited in system
catalogs. (This could also go before #4 or even before #3, but with a
noticeable hit to usability.)

Patch #6 (new development). Add a contrib module using the facility
added in #5, perhaps with a slightly off-beat compressor like bzip2
that is more of a niche use case.

I think that if the patch set were broken up this way, it would be a
lot easier to review and get committed. I think you could commit each
bit separately. I don't think you'd want to commit #1 unless you had a
sense that #2 was pretty close to done, and similarly for #5 and #6,
but that would still make things a lot easier than having one giant
monolithic patch, at least IMHO.

There might be more to say here, but that's what I have got for now. I
hope it helps.

Thanks,

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, David Steele <david(at)pgmasters(dot)net>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2020-06-22 20:53:16
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On 2020-06-19 13:03:02 -0400, Robert Haas wrote:
> - I can see three possible ways of breaking our dependence on 'pglz'
> for TOAST compression. Option #1 is to pick one new algorithm which we
> think is better than 'pglz' in all relevant ways and use it as the
> default for all new compressed datums. This would be dramatically
> simpler than what this patch does, because there would be no user
> interface. It would just be out with the old and in with the new.
> Option #2 is to create a short list of new algorithms that have
> different trade-offs; e.g. one that is very fast (like lz4) and one
> that has an extremely high compression ratio, and provide an interface
> for users to choose between them. This would be moderately simpler
> than what this patch does, because we would expose to the user
> anything about how a new compression method could be added, but it
> would still require a UI for the user to choose between the available
> (and hard-coded) options. It has the further advantage that every
> PostgreSQL cluster will offer the same options (or a subset of them,
> perhaps, depending on configure flags) and so you don't have to worry
> that, say, a pg_am row gets lost and suddenly all of your toasted data
> is inaccessible and uninterpretable. Option #3 is to do what this
> patch actually does, which is to allow for the addition of any number
> of compressors, including by extensions. It has the advantage that new
> compressors can be added with core's permission, so, for example, if
> it is unclear whether some excellent compressor is free of patent
> problems, we can elect not to ship support for it in core, while at
> the same time people who are willing to accept the associated legal
> risk can add that functionality to their own copy as an extension
> without having to patch core. The legal climate may even vary by
> jurisdiction, so what might be questionable in country A might be
> clearly just fine in country B. Aside from those issues, this approach
> allows people to experiment and innovate outside of core relatively
> quickly, instead of being bound by the somewhat cumbrous development
> process which has left this patch in limbo for the last few years. My
> view is that option #1 is likely to be impractical, because getting
> people to agree is hard, and better things are likely to come along
> later, and people like options. So I prefer either #2 or #3.

I personally favor going for #2, at least initially. Then we can discuss
the runtime-extensibility of #3 separately.

> - The next question is how a datum compressed with some non-default
> method should be represented on disk. The patch handles this first of
> all by making the observation that the compressed size can't be >=1GB,
> because the uncompressed size can't be >=1GB, and we wouldn't have
> stored it compressed if it expanded. Therefore, the upper two bits of
> the compressed size should always be zero on disk, and the patch
> steals one of them to indicate whether "custom" compression is in use.
> If it is, the 4-byte varlena header is followed not only by a 4-byte
> size (now with the new flag bit also included) but also by a 4-byte
> OID, indicating the compression AM in use. I don't think this is a
> terrible approach, but I don't think it's amazing, either. 4 bytes is
> quite a bit to use for this; if I guess correctly what will be a
> typical cluster configuration, you probably would really only need
> about 2 bits. For a datum that is both stored externally and
> compressed, the overhead is likely negligible, because the length is
> probably measured in kB or MB. But for a datum that is compressed but
> not stored externally, it seems pretty expensive; the datum is
> probably short, and having an extra 4 bytes of uncompressible data
> kinda sucks. One possibility would be to allow only one byte here:
> require each compression AM that is installed to advertise a one-byte
> value that will denote its compressed datums. If more than one AM
> tries to claim the same byte value, complain. Another possibility is
> to abandon this approach and go with #2 from the previous paragraph.
> Or maybe we add 1 or 2 "privileged" built-in compressors that get
> dedicated bit-patterns in the upper 2 bits of the size field, with the
> last bit pattern being reserved for future algorithms. (e.g. 0x00 =
> pglz, 0x01 = lz4, 0x10 = zstd, 0x11 = something else - see within for
> details).

Agreed. I favor an approach roughly like I'd implemented below
https://postgr.es/m/20130605150144.GD28067%40alap2.anarazel.de

I.e. leave the vartag etc as-is, but utilize the fact that pglz
compressed datums starts with a 4 byte length header, and that due to
the 1GB limit, the first two bits currently have to be 0. That allows to
indicate 2 compression methods without any space overhead, and
additional compression methods are supported by using an additional byte
(or some variable length encoded larger amount) if both bits are 1.

> - Yet another possible approach to the on-disk format is to leave
> varatt_external.va_extsize and varattrib_4b.rawsize untouched and
> instead add new compression methods by adding new vartag_external
> values. There's quite a lot of bit-space available there: we have a
> whole byte, and we're currently only using 4 values. We could easily
> add a half-dozen new possibilities there for new compression methods
> without sweating the bit-space consumption. The main thing I don't
> like about this is that it only seems like a useful way to provide for
> out-of-line compression. Perhaps it could be generalized to allow for
> inline compression as well, but it seems like it would take some
> hacking.

One additional note: Adding additional vartag_external values does incur
some noticable cost, distributed across lots of places.

> - One thing I really don't like about the patch is that it consumes a
> bit from infomask2 for a new flag HEAP_HASCUSTOMCOMPRESSED. infomask
> bits are at a premium, and there's been no real progress in the decade
> plus that I've been hanging around here in clawing back any bit-space.
> I think we really need to avoid burning our remaining bits for
> anything other than a really critical need, and I don't think I
> understand what the need is in this case. I might be missing
> something, but I'd really strongly suggest looking for a way to get
> rid of this. It also invents the concept of a TupleDesc flag, and the
> flag invented is TD_ATTR_CUSTOM_COMPRESSED; I'm not sure I see why we
> need that, either.

+many

Small note: The current patch adds #include "postgres.h" to a few
headers - it shouldn't do so.

Greetings,

Andres Freund

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, David Steele <david(at)pgmasters(dot)net>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2020-06-23 18:27:47
Message-ID:	CA+TgmobA79BpyVzBTcCMB56SZeVwtCRjJxnBWPYvVmumzFfktw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Jun 22, 2020 at 4:53 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > Or maybe we add 1 or 2 "privileged" built-in compressors that get
> > dedicated bit-patterns in the upper 2 bits of the size field, with the
> > last bit pattern being reserved for future algorithms. (e.g. 0x00 =
> > pglz, 0x01 = lz4, 0x10 = zstd, 0x11 = something else - see within for
> > details).
>
> Agreed. I favor an approach roughly like I'd implemented below
> https://postgr.es/m/20130605150144.GD28067%40alap2.anarazel.de
> I.e. leave the vartag etc as-is, but utilize the fact that pglz
> compressed datums starts with a 4 byte length header, and that due to
> the 1GB limit, the first two bits currently have to be 0. That allows to
> indicate 2 compression methods without any space overhead, and
> additional compression methods are supported by using an additional byte
> (or some variable length encoded larger amount) if both bits are 1.

I think there's essentially no difference between these two ideas,
unless the two bits we're talking about stealing are not the same in
the two cases. Am I missing something?

> One additional note: Adding additional vartag_external values does incur
> some noticable cost, distributed across lots of places.

OK.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, David Steele <david(at)pgmasters(dot)net>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2020-06-23 20:00:42
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On 2020-06-23 14:27:47 -0400, Robert Haas wrote:
> On Mon, Jun 22, 2020 at 4:53 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > > Or maybe we add 1 or 2 "privileged" built-in compressors that get
> > > dedicated bit-patterns in the upper 2 bits of the size field, with the
> > > last bit pattern being reserved for future algorithms. (e.g. 0x00 =
> > > pglz, 0x01 = lz4, 0x10 = zstd, 0x11 = something else - see within for
> > > details).
> >
> > Agreed. I favor an approach roughly like I'd implemented below
> > https://postgr.es/m/20130605150144.GD28067%40alap2.anarazel.de
> > I.e. leave the vartag etc as-is, but utilize the fact that pglz
> > compressed datums starts with a 4 byte length header, and that due to
> > the 1GB limit, the first two bits currently have to be 0. That allows to
> > indicate 2 compression methods without any space overhead, and
> > additional compression methods are supported by using an additional byte
> > (or some variable length encoded larger amount) if both bits are 1.

https://postgr.es/m/20130621000900.GA12425%40alap2.anarazel.de is a
thread with more information / patches further along.

> I think there's essentially no difference between these two ideas,
> unless the two bits we're talking about stealing are not the same in
> the two cases. Am I missing something?

I confused this patch with the approach in
https://www.postgresql.org/message-id/d8576096-76ba-487d-515b-44fdedba8bb5%402ndquadrant.com
sorry for that. It obviously still differs by not having lower space
overhead (by virtue of not having a 4 byte 'va_cmid', but no additional
space for two methods, and then 1 byte overhead for 256 more), but
that's not that fundamental a difference.

I do think it's nicer to hide the details of the compression inside
toast specific code as the version in the "further along" thread above
did.

The varlena stuff feels so archaic, it's hard to keep it all in my head...

I think I've pondered that elsewhere before (but perhaps just on IM with
you?), but I do think we'll need a better toast pointer format at some
point. It's pretty fundamentally based on having the 1GB limit, which I
don't think we can justify for that much longer.

Using something like https://postgr.es/m/20191210015054.5otdfuftxrqb5gum%40alap3.anarazel.de
I'd probably make it something roughly like:

1) signed varint indicating "in-place" length
1a) if positive, it's "plain" "in-place" data
1b) if negative, data type indicator follows. abs(length) includes size of metadata.
2) optional: unsigned varint metadata type indicator
3) data

Because 1) is the size of the data, toast datums can be skipped with a
relatively low amount of instructions during tuple deforming. Instead of
needing a fair number of branches, as the case right now.

So a small in-place uncompressed varlena2 would have an overhead of 1
byte up to 63 bytes, and 2 bytes otherwise (with 8 kb pages at least).

An in-place compressed datum could have an overhead as low as 3 bytes (1
byte length, 1 byte indicator for type of compression, 1 byte raw size),
although I suspect it's rarely going to be useful at that small sizes.

Anyway. I think it's probably reasonable to utilize those two bits
before going to a new toast format. But if somebody were more interested
in working on toastv2 I'd not push back either.

Regards,

Andres

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, David Steele <david(at)pgmasters(dot)net>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2020-06-24 11:59:47
Message-ID:	CA+TgmobJaisQF2MQi5EF8wKHHZqRpPjFYa5fYYMMfRJXC4ftSg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Jun 23, 2020 at 4:00 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> https://postgr.es/m/20130621000900.GA12425%40alap2.anarazel.de is a
> thread with more information / patches further along.
>
> I confused this patch with the approach in
> https://www.postgresql.org/message-id/d8576096-76ba-487d-515b-44fdedba8bb5%402ndquadrant.com
> sorry for that. It obviously still differs by not having lower space
> overhead (by virtue of not having a 4 byte 'va_cmid', but no additional
> space for two methods, and then 1 byte overhead for 256 more), but
> that's not that fundamental a difference.

Wait a minute. Are we saying there are three (3) dueling patches for
adding an alternate TOAST algorithm? It seems like there is:

This "custom compression methods" thread - vintage 2017 - Original
code by Nikita Glukhov, later work by Ildus Kurbangaliev
The "pluggable compression support" thread - vintage 2013 - Andres Freund
The "plgz performance" thread - vintage 2019 - Petr Jelinek

Anyone want to point to a FOURTH implementation of this feature?

I guess the next thing to do is figure out which one is the best basis
for further work.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, David Steele <david(at)pgmasters(dot)net>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2020-06-28 12:22:38
Message-ID:	CAFiTN-v3soZKaYtR2ig43t4haJJx3FZXMd2hDaj3E1mtSjwJPg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Jun 24, 2020 at 5:30 PM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> On Tue, Jun 23, 2020 at 4:00 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > https://postgr.es/m/20130621000900.GA12425%40alap2.anarazel.de is a
> > thread with more information / patches further along.
> >
> > I confused this patch with the approach in
> > https://www.postgresql.org/message-id/d8576096-76ba-487d-515b-44fdedba8bb5%402ndquadrant.com
> > sorry for that. It obviously still differs by not having lower space
> > overhead (by virtue of not having a 4 byte 'va_cmid', but no additional
> > space for two methods, and then 1 byte overhead for 256 more), but
> > that's not that fundamental a difference.
>
> Wait a minute. Are we saying there are three (3) dueling patches for
> adding an alternate TOAST algorithm? It seems like there is:
>
> This "custom compression methods" thread - vintage 2017 - Original
> code by Nikita Glukhov, later work by Ildus Kurbangaliev
> The "pluggable compression support" thread - vintage 2013 - Andres Freund
> The "plgz performance" thread - vintage 2019 - Petr Jelinek
>
> Anyone want to point to a FOURTH implementation of this feature?
>
> I guess the next thing to do is figure out which one is the best basis
> for further work.

I have gone through these 3 threads and here is a summary of what I
understand from them. Feel free to correct me if I have missed
something.

#1. Custom compression methods: Provide a mechanism to create/drop
compression methods by using external libraries, and it also provides
a way to set the compression method for the columns/types. There are
a few complexities with this approach those are listed below:

a. We need to maintain the dependencies between the column and the
compression method. And the bigger issue is, even if the compression
method is changed, we need to maintain the dependencies with the older
compression methods as we might have some older tuples that were
compressed with older methods.
b. Inside the compressed attribute, we need to maintain the
compression method so that we know how to decompress it. For this, we
use 2 bits from the raw_size of the compressed varlena header.

#2. pglz performance: Along with pglz this patch provides an
additional compression method using lz4. The new compression method
can be enabled/disabled during configure time or using SIGHUP. We use
1 bit from the raw_size of the compressed varlena header to identify
the compression method (pglz or lz4).

#3. pluggable compression: This proposal is to replace the existing
pglz algorithm, with the snappy or lz4 whichever is better. As per
the performance data[1], it appeared that the lz4 is the winner in
most of the cases.
- This also provides an additional patch to plugin any compression method.
- This will also use 2 bits from the raw_size of the compressed
attribute for identifying the compression method.
- Provide an option to select the compression method using GUC, but
the comments in the patch suggest to remove the GUC. So it seems that
GUC was used only for the POC.
- Honestly, I did not clearly understand from this patch set that
whether it proposes to replace the existing compression method with
the best method (and the plugin is just provided for performance
testing) or it actually proposes an option to have pluggable
compression methods.

IMHO, We can provide a solution based on #1 and #2, i.e. we can
provide a few best compression methods in the core, and on top of
that, we can also provide a mechanism to create/drop the external
compression methods.

[1] https://www.postgresql.org/message-id/20130621000900.GA12425%40alap2.anarazel.de

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, David Steele <david(at)pgmasters(dot)net>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2020-06-29 16:31:48
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On 2020-06-24 07:59:47 -0400, Robert Haas wrote:
> On Tue, Jun 23, 2020 at 4:00 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > https://postgr.es/m/20130621000900.GA12425%40alap2.anarazel.de is a
> > thread with more information / patches further along.
> >
> > I confused this patch with the approach in
> > https://www.postgresql.org/message-id/d8576096-76ba-487d-515b-44fdedba8bb5%402ndquadrant.com
> > sorry for that. It obviously still differs by not having lower space
> > overhead (by virtue of not having a 4 byte 'va_cmid', but no additional
> > space for two methods, and then 1 byte overhead for 256 more), but
> > that's not that fundamental a difference.
>
> Wait a minute. Are we saying there are three (3) dueling patches for
> adding an alternate TOAST algorithm? It seems like there is:
>
> This "custom compression methods" thread - vintage 2017 - Original
> code by Nikita Glukhov, later work by Ildus Kurbangaliev
> The "pluggable compression support" thread - vintage 2013 - Andres Freund
> The "plgz performance" thread - vintage 2019 - Petr Jelinek
>
> Anyone want to point to a FOURTH implementation of this feature?

To be clear, I don't think the 2003 patch should be considered as being
"in the running".

Greetings,

Andres Freund

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, David Steele <david(at)pgmasters(dot)net>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2020-07-24 15:39:46
Message-ID:	CA+TgmoYCoK7jSjMmBgH2JpNOYGui=bSMKVQW0yD519dgwDVLjQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Jun 29, 2020 at 12:31 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > This "custom compression methods" thread - vintage 2017 - Original
> > code by Nikita Glukhov, later work by Ildus Kurbangaliev
> > The "pluggable compression support" thread - vintage 2013 - Andres Freund
> > The "plgz performance" thread - vintage 2019 - Petr Jelinek
> >
> > Anyone want to point to a FOURTH implementation of this feature?
>
> To be clear, I don't think the 2003 patch should be considered as being
> "in the running".

I guess you mean 2013, not 2003?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, David Steele <david(at)pgmasters(dot)net>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Re: [HACKERS] Custom compression methods
Date:	2020-08-13 11:48:17
Message-ID:	CAFiTN-uUpX3ck=K0mLEk-G_kUQY=SNOTeqdaNRR9FMdQrHKebw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jun 19, 2020 at 10:33 PM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> On Thu, Mar 7, 2019 at 2:51 AM Alexander Korotkov
> <a(dot)korotkov(at)postgrespro(dot)ru> wrote:
> > Yes. I took a look at code of this patch. I think it's in pretty good shape. But high level review/discussion is required.
>
> I agree that the code of this patch is in pretty good shape, although
> there is a lot of rebasing needed at this point. Here is an attempt at
> some high level review and discussion:
>
> - As far as I can see, there is broad agreement that we shouldn't
> consider ourselves to be locked into 'pglz' forever. I believe
> numerous people have reported that there are other methods of doing
> compression that either compress better, or compress faster, or
> decompress faster, or all of the above. This isn't surprising and nor
> is it a knock on 'pglz'; Jan committed it in 1999, and it's not
> surprising that in 20 years some people have come up with better
> ideas. Not only that, but the quantity and quality of open source
> software that is available for this kind of thing and for many other
> kinds of things have improved dramatically in that time.
>
> - I can see three possible ways of breaking our dependence on 'pglz'
> for TOAST compression. Option #1 is to pick one new algorithm which we
> think is better than 'pglz' in all relevant ways and use it as the
> default for all new compressed datums. This would be dramatically
> simpler than what this patch does, because there would be no user
> interface. It would just be out with the old and in with the new.
> Option #2 is to create a short list of new algorithms that have
> different trade-offs; e.g. one that is very fast (like lz4) and one
> that has an extremely high compression ratio, and provide an interface
> for users to choose between them. This would be moderately simpler
> than what this patch does, because we would expose to the user
> anything about how a new compression method could be added, but it
> would still require a UI for the user to choose between the available
> (and hard-coded) options. It has the further advantage that every
> PostgreSQL cluster will offer the same options (or a subset of them,
> perhaps, depending on configure flags) and so you don't have to worry
> that, say, a pg_am row gets lost and suddenly all of your toasted data
> is inaccessible and uninterpretable. Option #3 is to do what this
> patch actually does, which is to allow for the addition of any number
> of compressors, including by extensions. It has the advantage that new
> compressors can be added with core's permission, so, for example, if
> it is unclear whether some excellent compressor is free of patent
> problems, we can elect not to ship support for it in core, while at
> the same time people who are willing to accept the associated legal
> risk can add that functionality to their own copy as an extension
> without having to patch core. The legal climate may even vary by
> jurisdiction, so what might be questionable in country A might be
> clearly just fine in country B. Aside from those issues, this approach
> allows people to experiment and innovate outside of core relatively
> quickly, instead of being bound by the somewhat cumbrous development
> process which has left this patch in limbo for the last few years. My
> view is that option #1 is likely to be impractical, because getting
> people to agree is hard, and better things are likely to come along
> later, and people like options. So I prefer either #2 or #3.
>
> - The next question is how a datum compressed with some non-default
> method should be represented on disk. The patch handles this first of
> all by making the observation that the compressed size can't be >=1GB,
> because the uncompressed size can't be >=1GB, and we wouldn't have
> stored it compressed if it expanded. Therefore, the upper two bits of
> the compressed size should always be zero on disk, and the patch
> steals one of them to indicate whether "custom" compression is in use.
> If it is, the 4-byte varlena header is followed not only by a 4-byte
> size (now with the new flag bit also included) but also by a 4-byte
> OID, indicating the compression AM in use. I don't think this is a
> terrible approach, but I don't think it's amazing, either. 4 bytes is
> quite a bit to use for this; if I guess correctly what will be a
> typical cluster configuration, you probably would really only need
> about 2 bits. For a datum that is both stored externally and
> compressed, the overhead is likely negligible, because the length is
> probably measured in kB or MB. But for a datum that is compressed but
> not stored externally, it seems pretty expensive; the datum is
> probably short, and having an extra 4 bytes of uncompressible data
> kinda sucks. One possibility would be to allow only one byte here:
> require each compression AM that is installed to advertise a one-byte
> value that will denote its compressed datums. If more than one AM
> tries to claim the same byte value, complain. Another possibility is
> to abandon this approach and go with #2 from the previous paragraph.
> Or maybe we add 1 or 2 "privileged" built-in compressors that get
> dedicated bit-patterns in the upper 2 bits of the size field, with the
> last bit pattern being reserved for future algorithms. (e.g. 0x00 =
> pglz, 0x01 = lz4, 0x10 = zstd, 0x11 = something else - see within for
> details).
>
> - I don't really like the use of the phrase "custom compression". I
> think the terminology needs to be rethought so that we just talk about
> compression methods. Perhaps in certain contexts we need to specify
> that we mean extensible compression methods or user-provided
> compression methods or something like that, but I don't think the word
> "custom" is very well-suited here. The main point of this shouldn't be
> for every cluster in the universe to use a different approach to
> compression, or to compress columns within a database in 47 different
> ways, but rather to help us get out from under 'pglz'. Eventually we
> probably want to change the default, but as the patch phrases things
> now, that default would be a custom method, which is almost a
> contradiction in terms.
>
> - Yet another possible approach to the on-disk format is to leave
> varatt_external.va_extsize and varattrib_4b.rawsize untouched and
> instead add new compression methods by adding new vartag_external
> values. There's quite a lot of bit-space available there: we have a
> whole byte, and we're currently only using 4 values. We could easily
> add a half-dozen new possibilities there for new compression methods
> without sweating the bit-space consumption. The main thing I don't
> like about this is that it only seems like a useful way to provide for
> out-of-line compression. Perhaps it could be generalized to allow for
> inline compression as well, but it seems like it would take some
> hacking.
>
> - One thing I really don't like about the patch is that it consumes a
> bit from infomask2 for a new flag HEAP_HASCUSTOMCOMPRESSED. infomask
> bits are at a premium, and there's been no real progress in the decade
> plus that I've been hanging around here in clawing back any bit-space.
> I think we really need to avoid burning our remaining bits for
> anything other than a really critical need, and I don't think I
> understand what the need is in this case. I might be missing
> something, but I'd really strongly suggest looking for a way to get
> rid of this. It also invents the concept of a TupleDesc flag, and the
> flag invented is TD_ATTR_CUSTOM_COMPRESSED; I'm not sure I see why we
> need that, either.
>
> - It seems like this kind of approach has a sort of built-in
> circularity problem. It means that every place that might need to
> detoast a datum needs to be able to access the pg_am catalog. I wonder
> if that's actually true. For instance, consider logical decoding. I
> guess that can do catalog lookups in general, but can it do them from
> the places where detoasting is happening? Moreover, can it do them
> with the right snapshot? Suppose we rewrite a table to change the
> compression method, then drop the old compression method, then try to
> decode a transaction that modified that table before those operations
> were performed. As an even more extreme example, suppose we need to
> open pg_am, and to do that we have to build a relcache entry for it,
> and suppose the relevant pg_class entry had a relacl or reloptions
> field that happened to be custom-compressed. Or equally suppose that
> any of the various other tables we use when building a relcache entry
> had the same kind of problem, especially those that have TOAST tables.
> We could just disallow the use of non-default compressors in the
> system catalogs, but the benefits mentioned in
> http://postgr.es/m/[email protected] seem too large to
> ignore.
>
> - I think it would be awfully appealing if we could find some way of
> dividing this great big patch into some somewhat smaller patches. For
> example:
>
> Patch #1. Add syntax allowing a compression method to be specified,
> but the only possible choice is pglz, and the PRESERVE stuff isn't
> supported, and changing the value associated with an existing column
> isn't supported, but we can add tab-completion support and stuff.
>
> Patch #2. Add a second built-in method, like gzip or lz4.
>
> Patch #3. Add support for changing the compression method associated
> with a column, forcing a table rewrite.
>
> Patch #4. Add support for PRESERVE, so that you can change the
> compression method associated with a column without forcing a table
> rewrite, by including the old method in the PRESERVE list, or with a
> rewrite, by not including it in the PRESERVE list.
>
> Patch #5. Add support for compression methods via the AM interface.
> Perhaps methods added in this manner are prohibited in system
> catalogs. (This could also go before #4 or even before #3, but with a
> noticeable hit to usability.)
>
> Patch #6 (new development). Add a contrib module using the facility
> added in #5, perhaps with a slightly off-beat compressor like bzip2
> that is more of a niche use case.
>
> I think that if the patch set were broken up this way, it would be a
> lot easier to review and get committed. I think you could commit each
> bit separately. I don't think you'd want to commit #1 unless you had a
> sense that #2 was pretty close to done, and similarly for #5 and #6,
> but that would still make things a lot easier than having one giant
> monolithic patch, at least IMHO.
>
> There might be more to say here, but that's what I have got for now. I
> hope it helps.

I have rebased the patch on the latest head and currently, broken into 3 parts.

v1-0001: As suggested by Robert, it provides the syntax support for
setting the compression method for a column while creating a table and
adding columns. However, we don't support changing the compression
method for the existing column. As part of this patch, there is only
one built-in compression method that can be set (pglz). In this, we
have one in-build am (pglz) and the compressed attributes will directly
store the oid of the AM. In this patch, I have removed the
pg_attr_compresion as we don't support changing the compression
for the existing column so we don't need to preserve the old
compressions.
v1-0002: Add another built-in compression method (zlib)
v1:0003: Remaining patch set (nothing is changed except rebase on the
current head, stabilizing check-world and 0001 and 0002 are pulled
out of this)

Next, I will be working on separating out the remaining patches as per
the suggestion by Robert.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment	Content-Type	Size
v1-0001-Add-support-for-setting-the-compression-method.patch	application/x-patch	224.9 KB
v1-0003-Custom-compression-methods.patch	application/x-patch	178.3 KB
v1-0002-Add-support-for-another-built-in-compression-meth.patch	application/x-patch	9.6 KB

From:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, David Steele <david(at)pgmasters(dot)net>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Re: [HACKERS] Custom compression methods
Date:	2020-08-24 06:12:12
Message-ID:	CAFiTN-viudxb0kGwtwsXhSCAaNZ4OUYcP-7PM+tr-rEbgGeHQw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Aug 13, 2020 at 5:18 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>

There was some question which Robert has asked in this mail, please
find my answer inline. Also, I have a few questions regarding further
splitting up this patch.

> On Fri, Jun 19, 2020 at 10:33 PM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> >

> >
> > - One thing I really don't like about the patch is that it consumes a
> > bit from infomask2 for a new flag HEAP_HASCUSTOMCOMPRESSED. infomask
> > bits are at a premium, and there's been no real progress in the decade
> > plus that I've been hanging around here in clawing back any bit-space.
> > I think we really need to avoid burning our remaining bits for
> > anything other than a really critical need, and I don't think I
> > understand what the need is in this case.

IIUC, the main reason for using this flag is for taking the decision
whether we need any detoasting for this tuple. For example, if we are
rewriting the table because the compression method is changed then if
HEAP_HASCUSTOMCOMPRESSED bit is not set in the tuple header and tuple
length, not tup->t_len > TOAST_TUPLE_THRESHOLD then we don't need to
call heap_toast_insert_or_update function for this tuple. Whereas if
this flag is set then we need to because we might need to uncompress
and compress back using a different compression method. The same is
the case with INSERT into SELECT * FROM.

I might be missing
> > something, but I'd really strongly suggest looking for a way to get
> > rid of this. It also invents the concept of a TupleDesc flag, and the
> > flag invented is TD_ATTR_CUSTOM_COMPRESSED; I'm not sure I see why we
> > need that, either.

This is also used in a similar way as the above but for the target
table, i.e. if the target table has the custom compressed attribute
then maybe we can not directly insert the tuple because it might have
compressed data which are compressed using the default compression
methods.

> > - It seems like this kind of approach has a sort of built-in
> > circularity problem. It means that every place that might need to
> > detoast a datum needs to be able to access the pg_am catalog. I wonder
> > if that's actually true. For instance, consider logical decoding. I
> > guess that can do catalog lookups in general, but can it do them from
> > the places where detoasting is happening? Moreover, can it do them
> > with the right snapshot? Suppose we rewrite a table to change the
> > compression method, then drop the old compression method, then try to
> > decode a transaction that modified that table before those operations
> > were performed. As an even more extreme example, suppose we need to
> > open pg_am, and to do that we have to build a relcache entry for it,
> > and suppose the relevant pg_class entry had a relacl or reloptions
> > field that happened to be custom-compressed. Or equally suppose that
> > any of the various other tables we use when building a relcache entry
> > had the same kind of problem, especially those that have TOAST tables.
> > We could just disallow the use of non-default compressors in the
> > system catalogs, but the benefits mentioned in
> > http://postgr.es/m/[email protected] seem too large to
> > ignore.
> >
> > - I think it would be awfully appealing if we could find some way of
> > dividing this great big patch into some somewhat smaller patches. For
> > example:
> >
> > Patch #1. Add syntax allowing a compression method to be specified,
> > but the only possible choice is pglz, and the PRESERVE stuff isn't
> > supported, and changing the value associated with an existing column
> > isn't supported, but we can add tab-completion support and stuff.
> >
> > Patch #2. Add a second built-in method, like gzip or lz4.

I have already extracted these 2 patches from the main patch set.
But, in these patches, I am still storing the am_oid in the toast
header. I am not sure can we get rid of that at least for these 2
patches? But, then wherever we try to uncompress the tuple we need to
know the tuple descriptor to get the am_oid but I think that is not
possible in all the cases. Am I missing something here?

> > Patch #3. Add support for changing the compression method associated
> > with a column, forcing a table rewrite.
> >
> > Patch #4. Add support for PRESERVE, so that you can change the
> > compression method associated with a column without forcing a table
> > rewrite, by including the old method in the PRESERVE list, or with a
> > rewrite, by not including it in the PRESERVE list.

Does this make sense to have Patch #3 and Patch #4, without having
Patch #5? I mean why do we need to support rewrite or preserve unless
we have the customer compression methods right? because the build-in
compression method can not be dropped so why do we need to preserve?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Cc:	Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, David Steele <david(at)pgmasters(dot)net>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Re: [HACKERS] Custom compression methods
Date:	2020-08-25 17:49:51
Message-ID:	CA+TgmoYYeMv=eJqf2JF8VWDPe=S6rYfH78DOxzE-Sa2kQHhiPw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Aug 24, 2020 at 2:12 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> IIUC, the main reason for using this flag is for taking the decision
> whether we need any detoasting for this tuple. For example, if we are
> rewriting the table because the compression method is changed then if
> HEAP_HASCUSTOMCOMPRESSED bit is not set in the tuple header and tuple
> length, not tup->t_len > TOAST_TUPLE_THRESHOLD then we don't need to
> call heap_toast_insert_or_update function for this tuple. Whereas if
> this flag is set then we need to because we might need to uncompress
> and compress back using a different compression method. The same is
> the case with INSERT into SELECT * FROM.

This doesn't really seem worth it to me. I don't see how we can
justify burning an on-disk bit just to save a little bit of overhead
during a rare maintenance operation. If there's a performance problem
here we need to look for another way of mitigating it. Slowing CLUSTER
and/or VACUUM FULL down by a large amount for this feature would be
unacceptable, but is that really a problem? And if so, can we solve it
without requiring this bit?

> > > something, but I'd really strongly suggest looking for a way to get
> > > rid of this. It also invents the concept of a TupleDesc flag, and the
> > > flag invented is TD_ATTR_CUSTOM_COMPRESSED; I'm not sure I see why we
> > > need that, either.
>
> This is also used in a similar way as the above but for the target
> table, i.e. if the target table has the custom compressed attribute
> then maybe we can not directly insert the tuple because it might have
> compressed data which are compressed using the default compression
> methods.

I think this is just an in-memory flag, which is much less precious
than an on-disk bit. However, I still wonder whether it's really the
right design. I think that if we offer lz4 we may well want to make it
the default eventually, or perhaps even right away. If that ends up
causing this flag to get set on every tuple all the time, then it
won't really optimize anything.

> I have already extracted these 2 patches from the main patch set.
> But, in these patches, I am still storing the am_oid in the toast
> header. I am not sure can we get rid of that at least for these 2
> patches? But, then wherever we try to uncompress the tuple we need to
> know the tuple descriptor to get the am_oid but I think that is not
> possible in all the cases. Am I missing something here?

I think we should instead use the high bits of the toast size word for
patches #1-#4, as discussed upthread.

> > > Patch #3. Add support for changing the compression method associated
> > > with a column, forcing a table rewrite.
> > >
> > > Patch #4. Add support for PRESERVE, so that you can change the
> > > compression method associated with a column without forcing a table
> > > rewrite, by including the old method in the PRESERVE list, or with a
> > > rewrite, by not including it in the PRESERVE list.
>
> Does this make sense to have Patch #3 and Patch #4, without having
> Patch #5? I mean why do we need to support rewrite or preserve unless
> we have the customer compression methods right? because the build-in
> compression method can not be dropped so why do we need to preserve?

I think that patch #3 makes sense because somebody might have a table
that is currently compressed with pglz and they want to switch to lz4,
and I think patch #4 also makes sense because they might want to start
using lz4 for future data but not force a rewrite to get rid of all
the pglz data they've already got. Those options are valuable as soon
as there is more than one possible compression algorithm, even if
they're all built in. Now, as I said upthread, it's also true that you
could do #5 before #3 and #4. I don't think that's insane. But I
prefer it in the other order, because I think having #5 without #3 and
#4 wouldn't be too much fun for users.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com>
To:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, David Steele <david(at)pgmasters(dot)net>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Re: [HACKERS] Custom compression methods
Date:	2020-08-31 05:14:25
Message-ID:	CAJ3gD9c370yZFgHRAv+Up987o=t94sr22R36T=BKX5=_fw=TzA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 13 Aug 2020 at 17:18, Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> I have rebased the patch on the latest head and currently, broken into 3 parts.
>
> v1-0001: As suggested by Robert, it provides the syntax support for
> setting the compression method for a column while creating a table and
> adding columns. However, we don't support changing the compression
> method for the existing column. As part of this patch, there is only
> one built-in compression method that can be set (pglz). In this, we
> have one in-build am (pglz) and the compressed attributes will directly
> store the oid of the AM. In this patch, I have removed the
> pg_attr_compresion as we don't support changing the compression
> for the existing column so we don't need to preserve the old
> compressions.
> v1-0002: Add another built-in compression method (zlib)
> v1:0003: Remaining patch set (nothing is changed except rebase on the
> current head, stabilizing check-world and 0001 and 0002 are pulled
> out of this)
>
> Next, I will be working on separating out the remaining patches as per
> the suggestion by Robert.

Thanks for this new feature. Looks promising and very useful, with so
many good compression libraries already available.

I see that with the patch-set, I would be able to create an extension
that defines a PostgreSQL C handler function which assigns all the
required hook function implementations for compressing, decompressing
and validating, etc. In short, I would be able to use a completely
different compression algorithm to compress toast data if I write such
an extension. Correct me if I am wrong with my interpretation.

Just a quick superficial set of review comments ....

A minor re-base is required due to a conflict in a regression test

-------------

In heap_toast_insert_or_update() and in other places, the comments for
new parameter preserved_am_info are missing.

-------------

+toast_compress_datum(Datum value, Oid acoid)
{
struct varlena *tmp = NULL;
int32 valsize;
- CompressionAmOptions cmoptions;
+ CompressionAmOptions *cmoptions = NULL;

I think tmp and cmoptions need not be initialized to NULL

-------------

- TOAST_COMPRESS_SET_RAWSIZE(tmp, valsize);
- SET_VARSIZE_COMPRESSED(tmp, len + TOAST_COMPRESS_HDRSZ);
/* successful compression */
+ toast_set_compressed_datum_info(tmp, amoid, valsize);
return PointerGetDatum(tmp);

Any particular reason why is this code put in a new extern function ?
Is there a plan to re-use it ? Otherwise, it's not necessary to do
this.

------------

Also, not sure why "HTAB *amoptions_cache" and "MemoryContext
amoptions_cache_mcxt" aren't static declarations. They are being used
only in toast_internals.c

-----------

The tab-completion doesn't show COMPRESSION :
postgres=# create access method my_method TYPE
INDEX TABLE
postgres=# create access method my_method TYPE

Also, the below syntax also would better be tab-completed so as to
display all the installed compression methods, in line with how we
show all the storage methods like plain,extended,etc:
postgres=# ALTER TABLE lztab ALTER COLUMN t SET COMPRESSION

------------

I could see the differences in compression ratio, and the compression
and decompression speed when I use lz versus zib :

CREATE TABLE zlibtab(t TEXT COMPRESSION zlib WITH (level '4'));
create table lztab(t text);
ALTER TABLE lztab ALTER COLUMN t SET COMPRESSION pglz;

pgg:s2:pg$ time psql -c "\copy zlibtab from text.data"
COPY 13050

real 0m1.344s
user 0m0.031s
sys 0m0.026s

pgg:s2:pg$ time psql -c "\copy lztab from text.data"
COPY 13050

real 0m2.088s
user 0m0.008s
sys 0m0.050s

pgg:s2:pg$ time psql -c "select pg_table_size('zlibtab'::regclass),
pg_table_size('lztab'::regclass)"
pg_table_size | pg_table_size
---------------+---------------
1261568 | 1687552

pgg:s2:pg$ time psql -c "select NULL from zlibtab where t like '0000'"
> /dev/null

real 0m0.127s
user 0m0.000s
sys 0m0.002s

pgg:s2:pg$ time psql -c "select NULL from lztab where t like '0000'"
> /dev/null

real 0m0.050s
user 0m0.002s
sys 0m0.000s

--
Thanks,
-Amit Khandekar
Huawei Technologies

From:	Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>
To:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, David Steele <david(at)pgmasters(dot)net>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2020-09-01 23:27:22
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On Aug 13, 2020, at 4:48 AM, Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> v1-0001: As suggested by Robert, it provides the syntax support for
> setting the compression method for a column while creating a table and
> adding columns. However, we don't support changing the compression
> method for the existing column. As part of this patch, there is only
> one built-in compression method that can be set (pglz). In this, we
> have one in-build am (pglz) and the compressed attributes will directly
> store the oid of the AM. In this patch, I have removed the
> pg_attr_compresion as we don't support changing the compression
> for the existing column so we don't need to preserve the old
> compressions.

I do not like the way pglz compression is handled in this patch. After upgrading PostgreSQL to the first version with this patch included, pre-existing on-disk compressed data will not include any custom compression Oid in the header, and toast_decompress_datum will notice that and decompress the data directly using pglz_decompress. If the same data were then written back out, perhaps to another table, into a column with no custom compression method defined, it will get compressed by toast_compress_datum using DefaultCompressionOid, which is defined as PGLZ_COMPRESSION_AM_OID. That isn't a proper round-trip for the data, as when it gets re-compressed, the PGLZ_COMPRESSION_AM_OID gets written into the header, which makes the data a bit longer, but also means that it is not byte-for-byte the same as it was, which is counter-intuitive. Given that any given pglz compressed datum now has two totally different formats that might occur on disk, code may have to consider both of them, which increases code complexity, and regression tests will need to be written with coverage for both of them, which increases test complexity. It's also not easy to write the extra tests, as there isn't any way (that I see) to intentionally write out the traditional shorter form from a newer database server; you'd have to do something like a pg_upgrade test where you install an older server to write the older format, upgrade, and then check that the new server can handle it.

The cleanest solution to this would seem to be removal of the compression am's Oid from the header for all compression ams, so that pre-patch written data and post-patch written data look exactly the same. The other solution is to give pglz pride-of-place as the original compression mechanism, and just say that when pglz is the compression method, no Oid gets written to the header, and only when other compression methods are used does the Oid get written. This second option seems closer to the implementation that you already have, because you already handle the decompression of data where the Oid is lacking, so all you have to do is intentionally not write the Oid when compressing using pglz.

Or did I misunderstand your implementation?

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To:	Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, David Steele <david(at)pgmasters(dot)net>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2020-09-02 10:33:06
Message-ID:	CAFiTN-uUUwJnNq1gi5UDj-LAQPOQV0PsJ_=30QLn6HFUK98RZg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Sep 2, 2020 at 4:57 AM Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com> wrote:
>
>
>
> > On Aug 13, 2020, at 4:48 AM, Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> >
> > v1-0001: As suggested by Robert, it provides the syntax support for
> > setting the compression method for a column while creating a table and
> > adding columns. However, we don't support changing the compression
> > method for the existing column. As part of this patch, there is only
> > one built-in compression method that can be set (pglz). In this, we
> > have one in-build am (pglz) and the compressed attributes will directly
> > store the oid of the AM. In this patch, I have removed the
> > pg_attr_compresion as we don't support changing the compression
> > for the existing column so we don't need to preserve the old
> > compressions.
>
> I do not like the way pglz compression is handled in this patch. After upgrading PostgreSQL to the first version with this patch included, pre-existing on-disk compressed data will not include any custom compression Oid in the header, and toast_decompress_datum will notice that and decompress the data directly using pglz_decompress. If the same data were then written back out, perhaps to another table, into a column with no custom compression method defined, it will get compressed by toast_compress_datum using DefaultCompressionOid, which is defined as PGLZ_COMPRESSION_AM_OID. That isn't a proper round-trip for the data, as when it gets re-compressed, the PGLZ_COMPRESSION_AM_OID gets written into the header, which makes the data a bit longer, but also means that it is not byte-for-byte the same as it was, which is counter-intuitive. Given that any given pglz compressed datum now has two totally different formats that might occur on disk, code may have to consider both of them, which increases code complexity, and regression tests will need to be written with coverage for both of them, which increases test complexity. It's also not easy to write the extra tests, as there isn't any way (that I see) to intentionally write out the traditional shorter form from a newer database server; you'd have to do something like a pg_upgrade test where you install an older server to write the older format, upgrade, and then check that the new server can handle it.
>
> The cleanest solution to this would seem to be removal of the compression am's Oid from the header for all compression ams, so that pre-patch written data and post-patch written data look exactly the same. The other solution is to give pglz pride-of-place as the original compression mechanism, and just say that when pglz is the compression method, no Oid gets written to the header, and only when other compression methods are used does the Oid get written. This second option seems closer to the implementation that you already have, because you already handle the decompression of data where the Oid is lacking, so all you have to do is intentionally not write the Oid when compressing using pglz.
>
> Or did I misunderstand your implementation?

Thanks for looking into it. Actually, I am planning to change this
patch such that we will use the upper 2 bits of the size field instead
of storing the amoid for the builtin compression methods.
e. g. 0x00 = pglz, 0x01 = zlib, 0x10 = other built-in, 0x11 -> custom
compression method. And when 0x11 is set then only we will store the
amoid in the toast header. I think after a week or two I will make
these changes and post my updated patch.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

From:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To:	Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, David Steele <david(at)pgmasters(dot)net>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Re: [HACKERS] Custom compression methods
Date:	2020-09-02 10:39:14
Message-ID:	CAFiTN-tkGOtVSJD0FDZCuVrAYCCURWwRtBxT_4KVduJu-9fueQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Aug 31, 2020 at 10:45 AM Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com> wrote:
>
> On Thu, 13 Aug 2020 at 17:18, Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> > I have rebased the patch on the latest head and currently, broken into 3 parts.
> >
> > v1-0001: As suggested by Robert, it provides the syntax support for
> > setting the compression method for a column while creating a table and
> > adding columns. However, we don't support changing the compression
> > method for the existing column. As part of this patch, there is only
> > one built-in compression method that can be set (pglz). In this, we
> > have one in-build am (pglz) and the compressed attributes will directly
> > store the oid of the AM. In this patch, I have removed the
> > pg_attr_compresion as we don't support changing the compression
> > for the existing column so we don't need to preserve the old
> > compressions.
> > v1-0002: Add another built-in compression method (zlib)
> > v1:0003: Remaining patch set (nothing is changed except rebase on the
> > current head, stabilizing check-world and 0001 and 0002 are pulled
> > out of this)
> >
> > Next, I will be working on separating out the remaining patches as per
> > the suggestion by Robert.
>
> Thanks for this new feature. Looks promising and very useful, with so
> many good compression libraries already available.

Thanks for looking into it.

> I see that with the patch-set, I would be able to create an extension
> that defines a PostgreSQL C handler function which assigns all the
> required hook function implementations for compressing, decompressing
> and validating, etc. In short, I would be able to use a completely
> different compression algorithm to compress toast data if I write such
> an extension. Correct me if I am wrong with my interpretation.
>
> Just a quick superficial set of review comments ....
>
> A minor re-base is required due to a conflict in a regression test

Okay, I will do this.

> -------------
>
> In heap_toast_insert_or_update() and in other places, the comments for
> new parameter preserved_am_info are missing.
>
> -------------

> +toast_compress_datum(Datum value, Oid acoid)
> {
> struct varlena *tmp = NULL;
> int32 valsize;
> - CompressionAmOptions cmoptions;
> + CompressionAmOptions *cmoptions = NULL;
>
> I think tmp and cmoptions need not be initialized to NULL

Right

> -------------
>
> - TOAST_COMPRESS_SET_RAWSIZE(tmp, valsize);
> - SET_VARSIZE_COMPRESSED(tmp, len + TOAST_COMPRESS_HDRSZ);
> /* successful compression */
> + toast_set_compressed_datum_info(tmp, amoid, valsize);
> return PointerGetDatum(tmp);
>
> Any particular reason why is this code put in a new extern function ?
> Is there a plan to re-use it ? Otherwise, it's not necessary to do
> this.
>
> ------------
>
> Also, not sure why "HTAB *amoptions_cache" and "MemoryContext
> amoptions_cache_mcxt" aren't static declarations. They are being used
> only in toast_internals.c

> -----------
>
> The tab-completion doesn't show COMPRESSION :
> postgres=# create access method my_method TYPE
> INDEX TABLE
> postgres=# create access method my_method TYPE
>
> Also, the below syntax also would better be tab-completed so as to
> display all the installed compression methods, in line with how we
> show all the storage methods like plain,extended,etc:
> postgres=# ALTER TABLE lztab ALTER COLUMN t SET COMPRESSION
>
> ------------

I will fix these comments in the next version of the patch.

> I could see the differences in compression ratio, and the compression
> and decompression speed when I use lz versus zib :
>
> CREATE TABLE zlibtab(t TEXT COMPRESSION zlib WITH (level '4'));
> create table lztab(t text);
> ALTER TABLE lztab ALTER COLUMN t SET COMPRESSION pglz;
>
> pgg:s2:pg$ time psql -c "\copy zlibtab from text.data"
> COPY 13050
>
> real 0m1.344s
> user 0m0.031s
> sys 0m0.026s
>
> pgg:s2:pg$ time psql -c "\copy lztab from text.data"
> COPY 13050
>
> real 0m2.088s
> user 0m0.008s
> sys 0m0.050s
>
>
> pgg:s2:pg$ time psql -c "select pg_table_size('zlibtab'::regclass),
> pg_table_size('lztab'::regclass)"
> pg_table_size | pg_table_size
> ---------------+---------------
> 1261568 | 1687552
>
> pgg:s2:pg$ time psql -c "select NULL from zlibtab where t like '0000'"
> > /dev/null
>
> real 0m0.127s
> user 0m0.000s
> sys 0m0.002s
>
> pgg:s2:pg$ time psql -c "select NULL from lztab where t like '0000'"
> > /dev/null
>
> real 0m0.050s
> user 0m0.002s
> sys 0m0.000s
>

Thanks for testing this.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

From:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, David Steele <david(at)pgmasters(dot)net>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Re: [HACKERS] Custom compression methods
Date:	2020-09-19 07:49:48
Message-ID:	CAFiTN-vcbfy5ScKVUp16c1N_wzP0RL6EkPBAg_Jm3eDK0ftO5Q@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Aug 25, 2020 at 11:20 PM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> On Mon, Aug 24, 2020 at 2:12 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> > IIUC, the main reason for using this flag is for taking the decision
> > whether we need any detoasting for this tuple. For example, if we are
> > rewriting the table because the compression method is changed then if
> > HEAP_HASCUSTOMCOMPRESSED bit is not set in the tuple header and tuple
> > length, not tup->t_len > TOAST_TUPLE_THRESHOLD then we don't need to
> > call heap_toast_insert_or_update function for this tuple. Whereas if
> > this flag is set then we need to because we might need to uncompress
> > and compress back using a different compression method. The same is
> > the case with INSERT into SELECT * FROM.
>
> This doesn't really seem worth it to me. I don't see how we can
> justify burning an on-disk bit just to save a little bit of overhead
> during a rare maintenance operation. If there's a performance problem
> here we need to look for another way of mitigating it. Slowing CLUSTER
> and/or VACUUM FULL down by a large amount for this feature would be
> unacceptable, but is that really a problem? And if so, can we solve it
> without requiring this bit?

Okay, if we want to avoid keeping the bit then there are multiple ways
to handle this, but the only thing is none of that will be specific to
those scenarios.
approach1. In ExecModifyTable, we can process the source tuple and see
if any of the varlena attributes is compressed and its stored
compression method is not the same as the target table attribute then
we can decompress it.
approach2. In heap_prepare_insert, always call the
heap_toast_insert_or_update, therein we can check if any of the source
tuple attributes are compressed with different compression methods
then the target table then we can decompress it.

With either of the approach, we have to do this in a generic path
because the source of the tuple is not known, I mean it can be a
output from a function, or the join tuple or a subquery. So in the
attached patch, I have implemented with approach1.

For testing, I have implemented using approach1 as well as using
approach2 and I have checked the performance of the pg_bench to see
whether it impacts the performance of the generic paths or not, but I
did not see any impact.

>
> > I have already extracted these 2 patches from the main patch set.
> > But, in these patches, I am still storing the am_oid in the toast
> > header. I am not sure can we get rid of that at least for these 2
> > patches? But, then wherever we try to uncompress the tuple we need to
> > know the tuple descriptor to get the am_oid but I think that is not
> > possible in all the cases. Am I missing something here?
>
> I think we should instead use the high bits of the toast size word for
> patches #1-#4, as discussed upthread.
>
> > > > Patch #3. Add support for changing the compression method associated
> > > > with a column, forcing a table rewrite.
> > > >
> > > > Patch #4. Add support for PRESERVE, so that you can change the
> > > > compression method associated with a column without forcing a table
> > > > rewrite, by including the old method in the PRESERVE list, or with a
> > > > rewrite, by not including it in the PRESERVE list.
> >
> > Does this make sense to have Patch #3 and Patch #4, without having
> > Patch #5? I mean why do we need to support rewrite or preserve unless
> > we have the customer compression methods right? because the build-in
> > compression method can not be dropped so why do we need to preserve?
>
> I think that patch #3 makes sense because somebody might have a table
> that is currently compressed with pglz and they want to switch to lz4,
> and I think patch #4 also makes sense because they might want to start
> using lz4 for future data but not force a rewrite to get rid of all
> the pglz data they've already got. Those options are valuable as soon
> as there is more than one possible compression algorithm, even if
> they're all built in. Now, as I said upthread, it's also true that you
> could do #5 before #3 and #4. I don't think that's insane. But I
> prefer it in the other order, because I think having #5 without #3 and
> #4 wouldn't be too much fun for users.

Details of the attached patch set

0001: This provides syntax to set the compression method from the
built-in compression method (pglz or zlib). pg_attribute stores the
compression method (char) and there are conversion functions to
convert that compression method to the built-in compression array
index. As discussed up thread the first 2 bits will be storing the
compression method index using that we can directly get the handler
routing using the built-in compression method array.

0002: This patch provides an option to changes the compression method
for an existing column and it will rewrite the table.

Next, I will be working on providing an option to alter the
compression method without rewriting the whole table, basically, we
can provide a preserve list to preserve old compression methods.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment	Content-Type	Size
v2-0002-alter-table-set-compression.patch	application/octet-stream	9.6 KB
v2-0001-Built-in-compression-method.patch	application/octet-stream	175.0 KB

From:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, David Steele <david(at)pgmasters(dot)net>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Re: [HACKERS] Custom compression methods
Date:	2020-09-28 10:48:07
Message-ID:	CAFiTN-t+0y5xnPx+sSvverfXPk9E6OMUkdfgMg7mcTTbs8rokQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Sep 19, 2020 at 1:19 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> On Tue, Aug 25, 2020 at 11:20 PM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> >
> > On Mon, Aug 24, 2020 at 2:12 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> > > IIUC, the main reason for using this flag is for taking the decision
> > > whether we need any detoasting for this tuple. For example, if we are
> > > rewriting the table because the compression method is changed then if
> > > HEAP_HASCUSTOMCOMPRESSED bit is not set in the tuple header and tuple
> > > length, not tup->t_len > TOAST_TUPLE_THRESHOLD then we don't need to
> > > call heap_toast_insert_or_update function for this tuple. Whereas if
> > > this flag is set then we need to because we might need to uncompress
> > > and compress back using a different compression method. The same is
> > > the case with INSERT into SELECT * FROM.
> >
> > This doesn't really seem worth it to me. I don't see how we can
> > justify burning an on-disk bit just to save a little bit of overhead
> > during a rare maintenance operation. If there's a performance problem
> > here we need to look for another way of mitigating it. Slowing CLUSTER
> > and/or VACUUM FULL down by a large amount for this feature would be
> > unacceptable, but is that really a problem? And if so, can we solve it
> > without requiring this bit?
>
> Okay, if we want to avoid keeping the bit then there are multiple ways
> to handle this, but the only thing is none of that will be specific to
> those scenarios.
> approach1. In ExecModifyTable, we can process the source tuple and see
> if any of the varlena attributes is compressed and its stored
> compression method is not the same as the target table attribute then
> we can decompress it.
> approach2. In heap_prepare_insert, always call the
> heap_toast_insert_or_update, therein we can check if any of the source
> tuple attributes are compressed with different compression methods
> then the target table then we can decompress it.
>
> With either of the approach, we have to do this in a generic path
> because the source of the tuple is not known, I mean it can be a
> output from a function, or the join tuple or a subquery. So in the
> attached patch, I have implemented with approach1.
>
> For testing, I have implemented using approach1 as well as using
> approach2 and I have checked the performance of the pg_bench to see
> whether it impacts the performance of the generic paths or not, but I
> did not see any impact.
>
> >
> > > I have already extracted these 2 patches from the main patch set.
> > > But, in these patches, I am still storing the am_oid in the toast
> > > header. I am not sure can we get rid of that at least for these 2
> > > patches? But, then wherever we try to uncompress the tuple we need to
> > > know the tuple descriptor to get the am_oid but I think that is not
> > > possible in all the cases. Am I missing something here?
> >
> > I think we should instead use the high bits of the toast size word for
> > patches #1-#4, as discussed upthread.
> >
> > > > > Patch #3. Add support for changing the compression method associated
> > > > > with a column, forcing a table rewrite.
> > > > >
> > > > > Patch #4. Add support for PRESERVE, so that you can change the
> > > > > compression method associated with a column without forcing a table
> > > > > rewrite, by including the old method in the PRESERVE list, or with a
> > > > > rewrite, by not including it in the PRESERVE list.
> > >
> > > Does this make sense to have Patch #3 and Patch #4, without having
> > > Patch #5? I mean why do we need to support rewrite or preserve unless
> > > we have the customer compression methods right? because the build-in
> > > compression method can not be dropped so why do we need to preserve?
> >
> > I think that patch #3 makes sense because somebody might have a table
> > that is currently compressed with pglz and they want to switch to lz4,
> > and I think patch #4 also makes sense because they might want to start
> > using lz4 for future data but not force a rewrite to get rid of all
> > the pglz data they've already got. Those options are valuable as soon
> > as there is more than one possible compression algorithm, even if
> > they're all built in. Now, as I said upthread, it's also true that you
> > could do #5 before #3 and #4. I don't think that's insane. But I
> > prefer it in the other order, because I think having #5 without #3 and
> > #4 wouldn't be too much fun for users.
>
> Details of the attached patch set
>
> 0001: This provides syntax to set the compression method from the
> built-in compression method (pglz or zlib). pg_attribute stores the
> compression method (char) and there are conversion functions to
> convert that compression method to the built-in compression array
> index. As discussed up thread the first 2 bits will be storing the
> compression method index using that we can directly get the handler
> routing using the built-in compression method array.
>
> 0002: This patch provides an option to changes the compression method
> for an existing column and it will rewrite the table.
>
> Next, I will be working on providing an option to alter the
> compression method without rewriting the whole table, basically, we
> can provide a preserve list to preserve old compression methods.

I have rebased the patch and I have also done a couple of defect fixes
and some cleanup.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment	Content-Type	Size
v3-0002-alter-table-set-compression.patch	application/octet-stream	9.9 KB
v3-0001-Built-in-compression-method.patch	application/octet-stream	186.9 KB

From:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, David Steele <david(at)pgmasters(dot)net>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Re: [HACKERS] Custom compression methods
Date:	2020-10-04 10:31:17
Message-ID:	CAFiTN-v2_h2+nPv6A-XLrs+E-ms970Xr78wGYv8A9fmDb1P0AQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Sep 28, 2020 at 4:18 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> On Sat, Sep 19, 2020 at 1:19 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> >
> > On Tue, Aug 25, 2020 at 11:20 PM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> > >
> > > On Mon, Aug 24, 2020 at 2:12 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> > > > IIUC, the main reason for using this flag is for taking the decision
> > > > whether we need any detoasting for this tuple. For example, if we are
> > > > rewriting the table because the compression method is changed then if
> > > > HEAP_HASCUSTOMCOMPRESSED bit is not set in the tuple header and tuple
> > > > length, not tup->t_len > TOAST_TUPLE_THRESHOLD then we don't need to
> > > > call heap_toast_insert_or_update function for this tuple. Whereas if
> > > > this flag is set then we need to because we might need to uncompress
> > > > and compress back using a different compression method. The same is
> > > > the case with INSERT into SELECT * FROM.
> > >
> > > This doesn't really seem worth it to me. I don't see how we can
> > > justify burning an on-disk bit just to save a little bit of overhead
> > > during a rare maintenance operation. If there's a performance problem
> > > here we need to look for another way of mitigating it. Slowing CLUSTER
> > > and/or VACUUM FULL down by a large amount for this feature would be
> > > unacceptable, but is that really a problem? And if so, can we solve it
> > > without requiring this bit?
> >
> > Okay, if we want to avoid keeping the bit then there are multiple ways
> > to handle this, but the only thing is none of that will be specific to
> > those scenarios.
> > approach1. In ExecModifyTable, we can process the source tuple and see
> > if any of the varlena attributes is compressed and its stored
> > compression method is not the same as the target table attribute then
> > we can decompress it.
> > approach2. In heap_prepare_insert, always call the
> > heap_toast_insert_or_update, therein we can check if any of the source
> > tuple attributes are compressed with different compression methods
> > then the target table then we can decompress it.
> >
> > With either of the approach, we have to do this in a generic path
> > because the source of the tuple is not known, I mean it can be a
> > output from a function, or the join tuple or a subquery. So in the
> > attached patch, I have implemented with approach1.
> >
> > For testing, I have implemented using approach1 as well as using
> > approach2 and I have checked the performance of the pg_bench to see
> > whether it impacts the performance of the generic paths or not, but I
> > did not see any impact.
> >
> > >
> > > > I have already extracted these 2 patches from the main patch set.
> > > > But, in these patches, I am still storing the am_oid in the toast
> > > > header. I am not sure can we get rid of that at least for these 2
> > > > patches? But, then wherever we try to uncompress the tuple we need to
> > > > know the tuple descriptor to get the am_oid but I think that is not
> > > > possible in all the cases. Am I missing something here?
> > >
> > > I think we should instead use the high bits of the toast size word for
> > > patches #1-#4, as discussed upthread.
> > >
> > > > > > Patch #3. Add support for changing the compression method associated
> > > > > > with a column, forcing a table rewrite.
> > > > > >
> > > > > > Patch #4. Add support for PRESERVE, so that you can change the
> > > > > > compression method associated with a column without forcing a table
> > > > > > rewrite, by including the old method in the PRESERVE list, or with a
> > > > > > rewrite, by not including it in the PRESERVE list.
> > > >
> > > > Does this make sense to have Patch #3 and Patch #4, without having
> > > > Patch #5? I mean why do we need to support rewrite or preserve unless
> > > > we have the customer compression methods right? because the build-in
> > > > compression method can not be dropped so why do we need to preserve?
> > >
> > > I think that patch #3 makes sense because somebody might have a table
> > > that is currently compressed with pglz and they want to switch to lz4,
> > > and I think patch #4 also makes sense because they might want to start
> > > using lz4 for future data but not force a rewrite to get rid of all
> > > the pglz data they've already got. Those options are valuable as soon
> > > as there is more than one possible compression algorithm, even if
> > > they're all built in. Now, as I said upthread, it's also true that you
> > > could do #5 before #3 and #4. I don't think that's insane. But I
> > > prefer it in the other order, because I think having #5 without #3 and
> > > #4 wouldn't be too much fun for users.
> >
> > Details of the attached patch set
> >
> > 0001: This provides syntax to set the compression method from the
> > built-in compression method (pglz or zlib). pg_attribute stores the
> > compression method (char) and there are conversion functions to
> > convert that compression method to the built-in compression array
> > index. As discussed up thread the first 2 bits will be storing the
> > compression method index using that we can directly get the handler
> > routing using the built-in compression method array.
> >
> > 0002: This patch provides an option to changes the compression method
> > for an existing column and it will rewrite the table.
> >
> > Next, I will be working on providing an option to alter the
> > compression method without rewriting the whole table, basically, we
> > can provide a preserve list to preserve old compression methods.
>
> I have rebased the patch and I have also done a couple of defect fixes
> and some cleanup.

Here is the next patch which allows providing a PRESERVE list using
this we can avoid table rewrite while altering the compression method.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment	Content-Type	Size
v3-0001-Built-in-compression-method.patch	application/octet-stream	186.9 KB
v3-0002-alter-table-set-compression.patch	application/octet-stream	11.9 KB
v3-0003-Add-support-for-PRESERVE.patch	application/octet-stream	33.4 KB

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, David Steele <david(at)pgmasters(dot)net>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2020-10-04 22:07:13
Message-ID:	20201004220713.6vlmm2e3amlz2dil@development
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

I took a look at this patch after a long time, and done a bit of a
review+testing. I haven't re-read the whole thread since 2017 so some of
the following comments might be mistaken - sorry about that :-(

1) The "cmapi.h" naming seems unnecessarily short. I'd suggest using
simply compression or something like that. I see little reason to
shorten "compression" to "cm", or to prefix files with "cm_". For
example compression/cm_zlib.c might just be compression/zlib.c.

2) I see index_form_tuple does this:

Datum cvalue = toast_compress_datum(untoasted_values[i],
DefaultCompressionMethod);

which seems wrong - why shouldn't the indexes use the same compression
method as the underlying table?

3) dumpTableSchema in pg_dump.c does this:

switch (tbinfo->attcompression[j])
{
case 'p':
cmname = "pglz";
case 'z':
cmname = "zlib";
}

which is broken as it's missing break, so 'p' will produce 'zlib'.

4) The name ExecCompareCompressionMethod is somewhat misleading, as the
functions is not merely comparing compression methods - it also
recompresses the data.

5) CheckCompressionMethodsPreserved should document what the return
value is (true when new list contains all old values, thus not requiring
a rewrite). Maybe "Compare" would be a better name?

6) The new field in ColumnDef is missing a comment.

7) It's not clear to me what "partial list" in the PRESERVE docs means.

+ which of them should be kept on the column. Without PRESERVE or partial
+ list of compression methods the table will be rewritten.

8) The initial synopsis in alter_table.sgml includes the PRESERVE
syntax, but then later in the page it's omitted (yet the section talks
about the keyword).

9) attcompression ...

The main issue I see is what the patch does with attcompression. Instead
of just using it to store a the compression method, it's also used to
store the preserved compression methods. And using NameData to store
this seems wrong too - if we really want to store this info, the correct
way is either using text[] or inventing charvector or similar.

But to me this seems very much like a misuse of attcompression to track
dependencies on compression methods, necessary because we don't have a
separate catalog listing compression methods. If we had that, I think we
could simply add dependencies between attributes and that catalog.

Moreover, having the catalog would allow adding compression methods
(from extensions etc) instead of just having a list of hard-coded
compression methods. Which seems like a strange limitation, considering
this thread is called "custom compression methods".

10) compression parameters?

I wonder if we could/should allow parameters, like compression level
(and maybe other stuff, depending on the compression method). PG13
allowed that for opclasses, so perhaps we should allow it here too.

11) pg_column_compression

When specifying compression method not present in attcompression, we get
this error message and hint:

test=# alter table t alter COLUMN a set compression "pglz" preserve (zlib);
ERROR: "zlib" compression access method cannot be preserved
HINT: use "pg_column_compression" function for list of compression methods

but there is no pg_column_compression function, so the hint is wrong.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, David Steele <david(at)pgmasters(dot)net>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2020-10-05 05:47:28
Message-ID:	CAFiTN-t9nb2rRdL3uauRPgv5rAP-yBuAKQKfY=jiRApSmDC4MQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Oct 5, 2020 at 3:37 AM Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:

Thanks, Tomas for your feedback.

> 9) attcompression ...
>
> The main issue I see is what the patch does with attcompression. Instead
> of just using it to store a the compression method, it's also used to
> store the preserved compression methods. And using NameData to store
> this seems wrong too - if we really want to store this info, the correct
> way is either using text[] or inventing charvector or similar.

The reason for using the NameData is the get it in the fixed part of
the data structure.

> But to me this seems very much like a misuse of attcompression to track
> dependencies on compression methods, necessary because we don't have a
> separate catalog listing compression methods. If we had that, I think we
> could simply add dependencies between attributes and that catalog.

Basically, up to this patch, we are having only built-in compression
methods and those can not be dropped so we don't need any dependency
at all. We just want to know what is the current compression method
and what is the preserve compression methods supported for this
attribute. Maybe we can do it better instead of using the NameData
but I don't think it makes sense to add a separate catalog?

> Moreover, having the catalog would allow adding compression methods
> (from extensions etc) instead of just having a list of hard-coded
> compression methods. Which seems like a strange limitation, considering
> this thread is called "custom compression methods".

I think I forgot to mention while submitting the previous patch that
the next patch I am planning to submit is, Support creating the custom
compression methods wherein we can use pg_am catalog to insert the new
compression method. And for dependency handling, we can create an
attribute dependency on the pg_am row. Basically, we will create the
attribute dependency on the current compression method AM as well as
on the preserved compression methods AM. As part of this, we will
add two build-in AMs for zlib and pglz, and the attcompression field
will be converted to the oid_vector (first OID will be of the current
compression method, followed by the preserved compression method's
oids).

> 10) compression parameters?
>
> I wonder if we could/should allow parameters, like compression level
> (and maybe other stuff, depending on the compression method). PG13
> allowed that for opclasses, so perhaps we should allow it here too.

Yes, that is also in the plan. For doing this we are planning to add
an extra column in the pg_attribute which will store the compression
options for the current compression method. The original patch was
creating an extra catalog pg_column_compression, therein it maintains
the oid of the compression method as well as the compression options.
The advantage of creating an extra catalog is that we can keep the
compression options for the preserved compression methods also so that
we can support the options which can be used for decompressing the
data as well. Whereas if we want to avoid this extra catalog then we
can not use that compression option for decompressing. But most of
the options e.g. compression level are just for the compressing so it
is enough to store for the current compression method only. What's
your thoughts?

Other comments look fine to me so I will work on them and post the
updated patch set.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

From:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, David Steele <david(at)pgmasters(dot)net>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2020-10-05 07:48:14
Message-ID:	CAFiTN-vp=or=1x=DjttV=WfLPsSWnKUVecL2PFGJHOqrrXTsrg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Oct 5, 2020 at 11:17 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> On Mon, Oct 5, 2020 at 3:37 AM Tomas Vondra
> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>
> Thanks, Tomas for your feedback.
>
> > 9) attcompression ...
> >
> > The main issue I see is what the patch does with attcompression. Instead
> > of just using it to store a the compression method, it's also used to
> > store the preserved compression methods. And using NameData to store
> > this seems wrong too - if we really want to store this info, the correct
> > way is either using text[] or inventing charvector or similar.
>
> The reason for using the NameData is the get it in the fixed part of
> the data structure.
>
> > But to me this seems very much like a misuse of attcompression to track
> > dependencies on compression methods, necessary because we don't have a
> > separate catalog listing compression methods. If we had that, I think we
> > could simply add dependencies between attributes and that catalog.
>
> Basically, up to this patch, we are having only built-in compression
> methods and those can not be dropped so we don't need any dependency
> at all. We just want to know what is the current compression method
> and what is the preserve compression methods supported for this
> attribute. Maybe we can do it better instead of using the NameData
> but I don't think it makes sense to add a separate catalog?
>
> > Moreover, having the catalog would allow adding compression methods
> > (from extensions etc) instead of just having a list of hard-coded
> > compression methods. Which seems like a strange limitation, considering
> > this thread is called "custom compression methods".
>
> I think I forgot to mention while submitting the previous patch that
> the next patch I am planning to submit is, Support creating the custom
> compression methods wherein we can use pg_am catalog to insert the new
> compression method. And for dependency handling, we can create an
> attribute dependency on the pg_am row. Basically, we will create the
> attribute dependency on the current compression method AM as well as
> on the preserved compression methods AM. As part of this, we will
> add two build-in AMs for zlib and pglz, and the attcompression field
> will be converted to the oid_vector (first OID will be of the current
> compression method, followed by the preserved compression method's
> oids).
>
> > 10) compression parameters?
> >
> > I wonder if we could/should allow parameters, like compression level
> > (and maybe other stuff, depending on the compression method). PG13
> > allowed that for opclasses, so perhaps we should allow it here too.
>
> Yes, that is also in the plan. For doing this we are planning to add
> an extra column in the pg_attribute which will store the compression
> options for the current compression method. The original patch was
> creating an extra catalog pg_column_compression, therein it maintains
> the oid of the compression method as well as the compression options.
> The advantage of creating an extra catalog is that we can keep the
> compression options for the preserved compression methods also so that
> we can support the options which can be used for decompressing the
> data as well. Whereas if we want to avoid this extra catalog then we
> can not use that compression option for decompressing. But most of
> the options e.g. compression level are just for the compressing so it
> is enough to store for the current compression method only. What's
> your thoughts?
>
> Other comments look fine to me so I will work on them and post the
> updated patch set.

I have fixed the other comments except this,

> 2) I see index_form_tuple does this:
>
> Datum cvalue = toast_compress_datum(untoasted_values[i],
> DefaultCompressionMethod);

> which seems wrong - why shouldn't the indexes use the same compression
> method as the underlying table?

I will fix this in the next version of the patch.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment	Content-Type	Size
v4-0001-Built-in-compression-method.patch	application/octet-stream	187.2 KB
v4-0003-Add-support-for-PRESERVE.patch	application/octet-stream	33.7 KB
v4-0002-alter-table-set-compression.patch	application/octet-stream	11.9 KB

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, David Steele <david(at)pgmasters(dot)net>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2020-10-05 12:23:07
Message-ID:	20201005122307.6sebqnn3qnvkvjoc@development
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Oct 05, 2020 at 11:17:28AM +0530, Dilip Kumar wrote:
>On Mon, Oct 5, 2020 at 3:37 AM Tomas Vondra
><tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>
>Thanks, Tomas for your feedback.
>
>> 9) attcompression ...
>>
>> The main issue I see is what the patch does with attcompression. Instead
>> of just using it to store a the compression method, it's also used to
>> store the preserved compression methods. And using NameData to store
>> this seems wrong too - if we really want to store this info, the correct
>> way is either using text[] or inventing charvector or similar.
>
>The reason for using the NameData is the get it in the fixed part of
>the data structure.
>

Why do we need that? It's possible to have varlena fields with direct
access (see pg_index.indkey for example). Adding NameData just to make
it fixed-length means we're always adding 64B even if we just need a
single byte, which means ~30% overhead for the FormData_pg_attribute.
That seems a bit unnecessary, and might be an issue with many attributes
(e.g. with many temp tables, etc.).

>> But to me this seems very much like a misuse of attcompression to track
>> dependencies on compression methods, necessary because we don't have a
>> separate catalog listing compression methods. If we had that, I think we
>> could simply add dependencies between attributes and that catalog.
>
>Basically, up to this patch, we are having only built-in compression
>methods and those can not be dropped so we don't need any dependency
>at all. We just want to know what is the current compression method
>and what is the preserve compression methods supported for this
>attribute. Maybe we can do it better instead of using the NameData
>but I don't think it makes sense to add a separate catalog?
>

Sure, I understand what the goal was - all I'm saying is that it looks
very much like a workaround needed because we don't have the catalog.

I don't quite understand how could we support custom compression methods
without listing them in some sort of catalog?

>> Moreover, having the catalog would allow adding compression methods
>> (from extensions etc) instead of just having a list of hard-coded
>> compression methods. Which seems like a strange limitation, considering
>> this thread is called "custom compression methods".
>
>I think I forgot to mention while submitting the previous patch that
>the next patch I am planning to submit is, Support creating the custom
>compression methods wherein we can use pg_am catalog to insert the new
>compression method. And for dependency handling, we can create an
>attribute dependency on the pg_am row. Basically, we will create the
>attribute dependency on the current compression method AM as well as
>on the preserved compression methods AM. As part of this, we will
>add two build-in AMs for zlib and pglz, and the attcompression field
>will be converted to the oid_vector (first OID will be of the current
>compression method, followed by the preserved compression method's
>oids).
>

Hmmm, ok. Not sure pg_am is the right place - compression methods don't
quite match what I though AMs are about, but maybe it's just my fault.

FWIW it seems a bit strange to first do the attcompression magic and
then add the catalog later - I think we should start with the catalog
right away. The advantage is that if we end up committing only some of
the patches in this cycle, we already have all the infrastructure etc.
We can reorder that later, though.

>> 10) compression parameters?
>>
>> I wonder if we could/should allow parameters, like compression level
>> (and maybe other stuff, depending on the compression method). PG13
>> allowed that for opclasses, so perhaps we should allow it here too.
>
>Yes, that is also in the plan. For doing this we are planning to add
>an extra column in the pg_attribute which will store the compression
>options for the current compression method. The original patch was
>creating an extra catalog pg_column_compression, therein it maintains
>the oid of the compression method as well as the compression options.
>The advantage of creating an extra catalog is that we can keep the
>compression options for the preserved compression methods also so that
>we can support the options which can be used for decompressing the
>data as well. Whereas if we want to avoid this extra catalog then we
>can not use that compression option for decompressing. But most of
>the options e.g. compression level are just for the compressing so it
>is enough to store for the current compression method only. What's
>your thoughts?
>

Not sure. My assumption was we'd end up with a new catalog, but maybe
stashing it into pg_attribute is fine. I was really thinking about two
kinds of options - compression level, and some sort of column-level
dictionary. Compression level is not necessary for decompression, but
the dictionary ID would be needed. (I think the global dictionary was
one of the use cases, aimed at JSON compression.)

But I don't think stashing it in pg_attribute means we couldn't use it
for decompression - we'd just need to keep an array of options, one for
each compression method. Keeping it in a separate new catalog might be
cleaner, and I'm not sure how large the configuration might be.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, David Steele <david(at)pgmasters(dot)net>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2020-10-05 14:27:41
Message-ID:	CAFiTN-tLLYKqwaB8E95b-1mbvvR7b9Z_NV0rxjM3m15LMJm+aQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Oct 5, 2020 at 5:53 PM Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>
> On Mon, Oct 05, 2020 at 11:17:28AM +0530, Dilip Kumar wrote:
> >On Mon, Oct 5, 2020 at 3:37 AM Tomas Vondra
> ><tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> >
> >Thanks, Tomas for your feedback.
> >
> >> 9) attcompression ...
> >>
> >> The main issue I see is what the patch does with attcompression. Instead
> >> of just using it to store a the compression method, it's also used to
> >> store the preserved compression methods. And using NameData to store
> >> this seems wrong too - if we really want to store this info, the correct
> >> way is either using text[] or inventing charvector or similar.
> >
> >The reason for using the NameData is the get it in the fixed part of
> >the data structure.
> >
>
> Why do we need that? It's possible to have varlena fields with direct
> access (see pg_index.indkey for example).

I see. While making it NameData I was thinking whether we have an
option to direct access the varlena. Thanks for pointing me there. I
will change this.

Adding NameData just to make
> it fixed-length means we're always adding 64B even if we just need a
> single byte, which means ~30% overhead for the FormData_pg_attribute.
> That seems a bit unnecessary, and might be an issue with many attributes
> (e.g. with many temp tables, etc.).

You are right. Even I did not like to keep 64B for this, so I will change it.

>
> >> But to me this seems very much like a misuse of attcompression to track
> >> dependencies on compression methods, necessary because we don't have a
> >> separate catalog listing compression methods. If we had that, I think we
> >> could simply add dependencies between attributes and that catalog.
> >
> >Basically, up to this patch, we are having only built-in compression
> >methods and those can not be dropped so we don't need any dependency
> >at all. We just want to know what is the current compression method
> >and what is the preserve compression methods supported for this
> >attribute. Maybe we can do it better instead of using the NameData
> >but I don't think it makes sense to add a separate catalog?
> >
>
> Sure, I understand what the goal was - all I'm saying is that it looks
> very much like a workaround needed because we don't have the catalog.
>
> I don't quite understand how could we support custom compression methods
> without listing them in some sort of catalog?

Yeah for supporting custom compression we need some catalog.

> >> Moreover, having the catalog would allow adding compression methods
> >> (from extensions etc) instead of just having a list of hard-coded
> >> compression methods. Which seems like a strange limitation, considering
> >> this thread is called "custom compression methods".
> >
> >I think I forgot to mention while submitting the previous patch that
> >the next patch I am planning to submit is, Support creating the custom
> >compression methods wherein we can use pg_am catalog to insert the new
> >compression method. And for dependency handling, we can create an
> >attribute dependency on the pg_am row. Basically, we will create the
> >attribute dependency on the current compression method AM as well as
> >on the preserved compression methods AM. As part of this, we will
> >add two build-in AMs for zlib and pglz, and the attcompression field
> >will be converted to the oid_vector (first OID will be of the current
> >compression method, followed by the preserved compression method's
> >oids).
> >
>
> Hmmm, ok. Not sure pg_am is the right place - compression methods don't
> quite match what I though AMs are about, but maybe it's just my fault.
>
> FWIW it seems a bit strange to first do the attcompression magic and
> then add the catalog later - I think we should start with the catalog
> right away. The advantage is that if we end up committing only some of
> the patches in this cycle, we already have all the infrastructure etc.
> We can reorder that later, though.

Hmm, yeah we can do this way as well that first create a new catalog
table and add entries for these two built-in methods and the
attcompression can store the oid vector. But if we only commit the
build-in compression methods part then does it make sense to create an
extra catalog or adding these build-in methods to the existing catalog
(if we plan to use pg_am). Then in attcompression instead of using
one byte for each preserve compression method, we need to use oid. So
from Robert's mail[1], it appeared to me that he wants that the
build-in compression methods part should be independently committable
and if we think from that perspective then adding a catalog doesn't
make much sense. But if we are planning to commit the custom method
also then it makes more sense to directly start with the catalog
because that way it will be easy to expand without much refactoring.

[1] https://www.postgresql.org/message-id/CA%2BTgmobSDVgUage9qQ5P_%3DF_9jaMkCgyKxUQGtFQU7oN4kX-AA%40mail.gmail.com

> >> 10) compression parameters?
> >>
> >> I wonder if we could/should allow parameters, like compression level
> >> (and maybe other stuff, depending on the compression method). PG13
> >> allowed that for opclasses, so perhaps we should allow it here too.
> >
> >Yes, that is also in the plan. For doing this we are planning to add
> >an extra column in the pg_attribute which will store the compression
> >options for the current compression method. The original patch was
> >creating an extra catalog pg_column_compression, therein it maintains
> >the oid of the compression method as well as the compression options.
> >The advantage of creating an extra catalog is that we can keep the
> >compression options for the preserved compression methods also so that
> >we can support the options which can be used for decompressing the
> >data as well. Whereas if we want to avoid this extra catalog then we
> >can not use that compression option for decompressing. But most of
> >the options e.g. compression level are just for the compressing so it
> >is enough to store for the current compression method only. What's
> >your thoughts?
> >
>
> Not sure. My assumption was we'd end up with a new catalog, but maybe
> stashing it into pg_attribute is fine. I was really thinking about two
> kinds of options - compression level, and some sort of column-level
> dictionary. Compression level is not necessary for decompression, but
> the dictionary ID would be needed. (I think the global dictionary was
> one of the use cases, aimed at JSON compression.)

> But I don't think stashing it in pg_attribute means we couldn't use it
> for decompression - we'd just need to keep an array of options, one for
> each compression method.

Yeah, we can do that.

Keeping it in a separate new catalog might be
> cleaner, and I'm not sure how large the configuration might be.

Yeah in that case it will be better to store in a separate catalog,
because sometimes if multiple attributes are using the same
compression method with the same options then we can store the same
oid in attcompression instead of duplicating the option field.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, David Steele <david(at)pgmasters(dot)net>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2020-10-05 16:03:55
Message-ID:	20201005160355.byp74sh3ejsv7wrj@development
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Oct 05, 2020 at 07:57:41PM +0530, Dilip Kumar wrote:
>On Mon, Oct 5, 2020 at 5:53 PM Tomas Vondra
><tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>>
>> On Mon, Oct 05, 2020 at 11:17:28AM +0530, Dilip Kumar wrote:
>> >On Mon, Oct 5, 2020 at 3:37 AM Tomas Vondra
>> ><tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> >
>> >Thanks, Tomas for your feedback.
>> >
>> >> 9) attcompression ...
>> >>
>> >> The main issue I see is what the patch does with attcompression. Instead
>> >> of just using it to store a the compression method, it's also used to
>> >> store the preserved compression methods. And using NameData to store
>> >> this seems wrong too - if we really want to store this info, the correct
>> >> way is either using text[] or inventing charvector or similar.
>> >
>> >The reason for using the NameData is the get it in the fixed part of
>> >the data structure.
>> >
>>
>> Why do we need that? It's possible to have varlena fields with direct
>> access (see pg_index.indkey for example).
>
>I see. While making it NameData I was thinking whether we have an
>option to direct access the varlena. Thanks for pointing me there. I
>will change this.
>
> Adding NameData just to make
>> it fixed-length means we're always adding 64B even if we just need a
>> single byte, which means ~30% overhead for the FormData_pg_attribute.
>> That seems a bit unnecessary, and might be an issue with many attributes
>> (e.g. with many temp tables, etc.).
>
>You are right. Even I did not like to keep 64B for this, so I will change it.
>
>>
>> >> But to me this seems very much like a misuse of attcompression to track
>> >> dependencies on compression methods, necessary because we don't have a
>> >> separate catalog listing compression methods. If we had that, I think we
>> >> could simply add dependencies between attributes and that catalog.
>> >
>> >Basically, up to this patch, we are having only built-in compression
>> >methods and those can not be dropped so we don't need any dependency
>> >at all. We just want to know what is the current compression method
>> >and what is the preserve compression methods supported for this
>> >attribute. Maybe we can do it better instead of using the NameData
>> >but I don't think it makes sense to add a separate catalog?
>> >
>>
>> Sure, I understand what the goal was - all I'm saying is that it looks
>> very much like a workaround needed because we don't have the catalog.
>>
>> I don't quite understand how could we support custom compression methods
>> without listing them in some sort of catalog?
>
>Yeah for supporting custom compression we need some catalog.
>
>> >> Moreover, having the catalog would allow adding compression methods
>> >> (from extensions etc) instead of just having a list of hard-coded
>> >> compression methods. Which seems like a strange limitation, considering
>> >> this thread is called "custom compression methods".
>> >
>> >I think I forgot to mention while submitting the previous patch that
>> >the next patch I am planning to submit is, Support creating the custom
>> >compression methods wherein we can use pg_am catalog to insert the new
>> >compression method. And for dependency handling, we can create an
>> >attribute dependency on the pg_am row. Basically, we will create the
>> >attribute dependency on the current compression method AM as well as
>> >on the preserved compression methods AM. As part of this, we will
>> >add two build-in AMs for zlib and pglz, and the attcompression field
>> >will be converted to the oid_vector (first OID will be of the current
>> >compression method, followed by the preserved compression method's
>> >oids).
>> >
>>
>> Hmmm, ok. Not sure pg_am is the right place - compression methods don't
>> quite match what I though AMs are about, but maybe it's just my fault.
>>
>> FWIW it seems a bit strange to first do the attcompression magic and
>> then add the catalog later - I think we should start with the catalog
>> right away. The advantage is that if we end up committing only some of
>> the patches in this cycle, we already have all the infrastructure etc.
>> We can reorder that later, though.
>
>Hmm, yeah we can do this way as well that first create a new catalog
>table and add entries for these two built-in methods and the
>attcompression can store the oid vector. But if we only commit the
>build-in compression methods part then does it make sense to create an
>extra catalog or adding these build-in methods to the existing catalog
>(if we plan to use pg_am). Then in attcompression instead of using
>one byte for each preserve compression method, we need to use oid. So
>from Robert's mail[1], it appeared to me that he wants that the
>build-in compression methods part should be independently committable
>and if we think from that perspective then adding a catalog doesn't
>make much sense. But if we are planning to commit the custom method
>also then it makes more sense to directly start with the catalog
>because that way it will be easy to expand without much refactoring.
>
>[1] https://www.postgresql.org/message-id/CA%2BTgmobSDVgUage9qQ5P_%3DF_9jaMkCgyKxUQGtFQU7oN4kX-AA%40mail.gmail.com
>

Hmmm. Maybe I'm missing something subtle, but I think that plan can be
interpreted in various ways - it does not really say whether the initial
list of built-in methods should be in some C array, or already in a proper
catalog.

All I'm saying is it seems a bit weird to first implement dependencies
based on strange (mis)use of attcompression attribute, and then replace
it with a proper catalog. My understanding is those patches are expected
to be committable one by one, but the attcompression approach seems a
bit too hacky to me - not sure I'd want to commit that ...

>> >> 10) compression parameters?
>> >>
>> >> I wonder if we could/should allow parameters, like compression level
>> >> (and maybe other stuff, depending on the compression method). PG13
>> >> allowed that for opclasses, so perhaps we should allow it here too.
>> >
>> >Yes, that is also in the plan. For doing this we are planning to add
>> >an extra column in the pg_attribute which will store the compression
>> >options for the current compression method. The original patch was
>> >creating an extra catalog pg_column_compression, therein it maintains
>> >the oid of the compression method as well as the compression options.
>> >The advantage of creating an extra catalog is that we can keep the
>> >compression options for the preserved compression methods also so that
>> >we can support the options which can be used for decompressing the
>> >data as well. Whereas if we want to avoid this extra catalog then we
>> >can not use that compression option for decompressing. But most of
>> >the options e.g. compression level are just for the compressing so it
>> >is enough to store for the current compression method only. What's
>> >your thoughts?
>> >
>>
>> Not sure. My assumption was we'd end up with a new catalog, but maybe
>> stashing it into pg_attribute is fine. I was really thinking about two
>> kinds of options - compression level, and some sort of column-level
>> dictionary. Compression level is not necessary for decompression, but
>> the dictionary ID would be needed. (I think the global dictionary was
>> one of the use cases, aimed at JSON compression.)
>
>Ok
>
>> But I don't think stashing it in pg_attribute means we couldn't use it
>> for decompression - we'd just need to keep an array of options, one for
>> each compression method.
>
>Yeah, we can do that.
>
>Keeping it in a separate new catalog might be
>> cleaner, and I'm not sure how large the configuration might be.
>
>Yeah in that case it will be better to store in a separate catalog,
>because sometimes if multiple attributes are using the same
>compression method with the same options then we can store the same
>oid in attcompression instead of duplicating the option field.
>

I doubt deduplicating the options like this is (sharing options between
columns) is really worth it, as it means extra complexity e.g. during
ALTER TABLE ... SET COMPRESSION. I don't think we do that for other
catalogs, so why should we do it here?

Ultimately I think it's a question of how large we expect the options to
be, and how flexible it needs to be.

For example, what happens if the user does this:

ALTER ... SET COMPRESSION my_compression WITH (options1) PRESERVE;
ALTER ... SET COMPRESSION pglz PRESERVE;
ALTER ... SET COMPRESSION my_compression WITH (options2) PRESERVE;

I believe it's enough to keep just the last value, but maybe I'm wrong
and we need to keep the whole history?

The use case I'm thinking about is the column-level JSON compression,
where one of the options identifies the dictionary. OTOH I'm not sure
this is the right way to track this info - we need to know which options
were compressed with which options, i.e. it needs to be encoded in each
value directly. It'd also require changes to the PRESERVE handling
because it'd be necessary to identify which options to preserve ...

So maybe this is either nonsense or something we don't want to support,
and we should only allow one option for each compression method.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, David Steele <david(at)pgmasters(dot)net>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2020-10-06 05:30:55
Message-ID:	CAFiTN-tttcnQ=KNb40Nvrwi6GsJOk_Ni2-S-Ud3Z7eU5144LDQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Oct 5, 2020 at 9:34 PM Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>
> On Mon, Oct 05, 2020 at 07:57:41PM +0530, Dilip Kumar wrote:
> >On Mon, Oct 5, 2020 at 5:53 PM Tomas Vondra
> ><tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> >>
> >> On Mon, Oct 05, 2020 at 11:17:28AM +0530, Dilip Kumar wrote:
> >> >On Mon, Oct 5, 2020 at 3:37 AM Tomas Vondra
> >> ><tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> >> >
> >> >Thanks, Tomas for your feedback.
> >> >
> >> >> 9) attcompression ...
> >> >>
> >> >> The main issue I see is what the patch does with attcompression. Instead
> >> >> of just using it to store a the compression method, it's also used to
> >> >> store the preserved compression methods. And using NameData to store
> >> >> this seems wrong too - if we really want to store this info, the correct
> >> >> way is either using text[] or inventing charvector or similar.
> >> >
> >> >The reason for using the NameData is the get it in the fixed part of
> >> >the data structure.
> >> >
> >>
> >> Why do we need that? It's possible to have varlena fields with direct
> >> access (see pg_index.indkey for example).
> >
> >I see. While making it NameData I was thinking whether we have an
> >option to direct access the varlena. Thanks for pointing me there. I
> >will change this.
> >
> > Adding NameData just to make
> >> it fixed-length means we're always adding 64B even if we just need a
> >> single byte, which means ~30% overhead for the FormData_pg_attribute.
> >> That seems a bit unnecessary, and might be an issue with many attributes
> >> (e.g. with many temp tables, etc.).
> >
> >You are right. Even I did not like to keep 64B for this, so I will change it.
> >
> >>
> >> >> But to me this seems very much like a misuse of attcompression to track
> >> >> dependencies on compression methods, necessary because we don't have a
> >> >> separate catalog listing compression methods. If we had that, I think we
> >> >> could simply add dependencies between attributes and that catalog.
> >> >
> >> >Basically, up to this patch, we are having only built-in compression
> >> >methods and those can not be dropped so we don't need any dependency
> >> >at all. We just want to know what is the current compression method
> >> >and what is the preserve compression methods supported for this
> >> >attribute. Maybe we can do it better instead of using the NameData
> >> >but I don't think it makes sense to add a separate catalog?
> >> >
> >>
> >> Sure, I understand what the goal was - all I'm saying is that it looks
> >> very much like a workaround needed because we don't have the catalog.
> >>
> >> I don't quite understand how could we support custom compression methods
> >> without listing them in some sort of catalog?
> >
> >Yeah for supporting custom compression we need some catalog.
> >
> >> >> Moreover, having the catalog would allow adding compression methods
> >> >> (from extensions etc) instead of just having a list of hard-coded
> >> >> compression methods. Which seems like a strange limitation, considering
> >> >> this thread is called "custom compression methods".
> >> >
> >> >I think I forgot to mention while submitting the previous patch that
> >> >the next patch I am planning to submit is, Support creating the custom
> >> >compression methods wherein we can use pg_am catalog to insert the new
> >> >compression method. And for dependency handling, we can create an
> >> >attribute dependency on the pg_am row. Basically, we will create the
> >> >attribute dependency on the current compression method AM as well as
> >> >on the preserved compression methods AM. As part of this, we will
> >> >add two build-in AMs for zlib and pglz, and the attcompression field
> >> >will be converted to the oid_vector (first OID will be of the current
> >> >compression method, followed by the preserved compression method's
> >> >oids).
> >> >
> >>
> >> Hmmm, ok. Not sure pg_am is the right place - compression methods don't
> >> quite match what I though AMs are about, but maybe it's just my fault.
> >>
> >> FWIW it seems a bit strange to first do the attcompression magic and
> >> then add the catalog later - I think we should start with the catalog
> >> right away. The advantage is that if we end up committing only some of
> >> the patches in this cycle, we already have all the infrastructure etc.
> >> We can reorder that later, though.
> >
> >Hmm, yeah we can do this way as well that first create a new catalog
> >table and add entries for these two built-in methods and the
> >attcompression can store the oid vector. But if we only commit the
> >build-in compression methods part then does it make sense to create an
> >extra catalog or adding these build-in methods to the existing catalog
> >(if we plan to use pg_am). Then in attcompression instead of using
> >one byte for each preserve compression method, we need to use oid. So
> >from Robert's mail[1], it appeared to me that he wants that the
> >build-in compression methods part should be independently committable
> >and if we think from that perspective then adding a catalog doesn't
> >make much sense. But if we are planning to commit the custom method
> >also then it makes more sense to directly start with the catalog
> >because that way it will be easy to expand without much refactoring.
> >
> >[1] https://www.postgresql.org/message-id/CA%2BTgmobSDVgUage9qQ5P_%3DF_9jaMkCgyKxUQGtFQU7oN4kX-AA%40mail.gmail.com
> >
>
> Hmmm. Maybe I'm missing something subtle, but I think that plan can be
> interpreted in various ways - it does not really say whether the initial
> list of built-in methods should be in some C array, or already in a proper
> catalog.
>
> All I'm saying is it seems a bit weird to first implement dependencies
> based on strange (mis)use of attcompression attribute, and then replace
> it with a proper catalog. My understanding is those patches are expected
> to be committable one by one, but the attcompression approach seems a
> bit too hacky to me - not sure I'd want to commit that ...

Okay, I will change this. So I will make create a new catalog
pg_compression and add the entry for two built-in compression methods
from the very first patch.

> >> >> 10) compression parameters?
> >> >>
> >> >> I wonder if we could/should allow parameters, like compression level
> >> >> (and maybe other stuff, depending on the compression method). PG13
> >> >> allowed that for opclasses, so perhaps we should allow it here too.
> >> >
> >> >Yes, that is also in the plan. For doing this we are planning to add
> >> >an extra column in the pg_attribute which will store the compression
> >> >options for the current compression method. The original patch was
> >> >creating an extra catalog pg_column_compression, therein it maintains
> >> >the oid of the compression method as well as the compression options.
> >> >The advantage of creating an extra catalog is that we can keep the
> >> >compression options for the preserved compression methods also so that
> >> >we can support the options which can be used for decompressing the
> >> >data as well. Whereas if we want to avoid this extra catalog then we
> >> >can not use that compression option for decompressing. But most of
> >> >the options e.g. compression level are just for the compressing so it
> >> >is enough to store for the current compression method only. What's
> >> >your thoughts?
> >> >
> >>
> >> Not sure. My assumption was we'd end up with a new catalog, but maybe
> >> stashing it into pg_attribute is fine. I was really thinking about two
> >> kinds of options - compression level, and some sort of column-level
> >> dictionary. Compression level is not necessary for decompression, but
> >> the dictionary ID would be needed. (I think the global dictionary was
> >> one of the use cases, aimed at JSON compression.)
> >
> >Ok
> >
> >> But I don't think stashing it in pg_attribute means we couldn't use it
> >> for decompression - we'd just need to keep an array of options, one for
> >> each compression method.
> >
> >Yeah, we can do that.
> >
> >Keeping it in a separate new catalog might be
> >> cleaner, and I'm not sure how large the configuration might be.
> >
> >Yeah in that case it will be better to store in a separate catalog,
> >because sometimes if multiple attributes are using the same
> >compression method with the same options then we can store the same
> >oid in attcompression instead of duplicating the option field.
> >
>
> I doubt deduplicating the options like this is (sharing options between
> columns) is really worth it, as it means extra complexity e.g. during
> ALTER TABLE ... SET COMPRESSION. I don't think we do that for other
> catalogs, so why should we do it here?

Yeah, valid point.

>
> Ultimately I think it's a question of how large we expect the options to
> be, and how flexible it needs to be.
>
> For example, what happens if the user does this:
>
> ALTER ... SET COMPRESSION my_compression WITH (options1) PRESERVE;
> ALTER ... SET COMPRESSION pglz PRESERVE;
> ALTER ... SET COMPRESSION my_compression WITH (options2) PRESERVE;
>
> I believe it's enough to keep just the last value, but maybe I'm wrong
> and we need to keep the whole history?

Currently, the syntax is like ALTER ... SET COMPRESSION my_compression
WITH (options1) PRESERVE (old_compression1, old_compression2..). But
I think if the user just gives
PRESERVE without a list then we should just preserve the latest one.

> The use case I'm thinking about is the column-level JSON compression,
> where one of the options identifies the dictionary. OTOH I'm not sure
> this is the right way to track this info - we need to know which options
> were compressed with which options, i.e. it needs to be encoded in each
> value directly. It'd also require changes to the PRESERVE handling
> because it'd be necessary to identify which options to preserve ...
>
> So maybe this is either nonsense or something we don't want to support,
> and we should only allow one option for each compression method.

Yeah, it is a bit confusing to add the same compression method with
different compression options, then in the preserve list, we will
have to allow the option as well along with the compression method to
know which compression method with what options we want to preserve.

And also as you mentioned that in rows we need to know the option as
well. I think for solving this anyways for the custom compression
methods we will have to store the OID of the compression method in the
toast header so we can provide an intermediate catalog which will
create a new row for each combination of compression method + option
and the toast header can store the OID of that row so that we know
with which compression method + option it was compressed with.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, David Steele <david(at)pgmasters(dot)net>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2020-10-06 16:51:32
Message-ID:	20201006165132.mvuide57kxexesse@development
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Oct 06, 2020 at 11:00:55AM +0530, Dilip Kumar wrote:
>On Mon, Oct 5, 2020 at 9:34 PM Tomas Vondra
><tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>>
>> On Mon, Oct 05, 2020 at 07:57:41PM +0530, Dilip Kumar wrote:
>> >On Mon, Oct 5, 2020 at 5:53 PM Tomas Vondra
>> ><tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> >>
>> >> On Mon, Oct 05, 2020 at 11:17:28AM +0530, Dilip Kumar wrote:
>> >> >On Mon, Oct 5, 2020 at 3:37 AM Tomas Vondra
>> >> ><tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> >> >
>> >> >Thanks, Tomas for your feedback.
>> >> >
>> >> >> 9) attcompression ...
>> >> >>
>> >> >> The main issue I see is what the patch does with attcompression. Instead
>> >> >> of just using it to store a the compression method, it's also used to
>> >> >> store the preserved compression methods. And using NameData to store
>> >> >> this seems wrong too - if we really want to store this info, the correct
>> >> >> way is either using text[] or inventing charvector or similar.
>> >> >
>> >> >The reason for using the NameData is the get it in the fixed part of
>> >> >the data structure.
>> >> >
>> >>
>> >> Why do we need that? It's possible to have varlena fields with direct
>> >> access (see pg_index.indkey for example).
>> >
>> >I see. While making it NameData I was thinking whether we have an
>> >option to direct access the varlena. Thanks for pointing me there. I
>> >will change this.
>> >
>> > Adding NameData just to make
>> >> it fixed-length means we're always adding 64B even if we just need a
>> >> single byte, which means ~30% overhead for the FormData_pg_attribute.
>> >> That seems a bit unnecessary, and might be an issue with many attributes
>> >> (e.g. with many temp tables, etc.).
>> >
>> >You are right. Even I did not like to keep 64B for this, so I will change it.
>> >
>> >>
>> >> >> But to me this seems very much like a misuse of attcompression to track
>> >> >> dependencies on compression methods, necessary because we don't have a
>> >> >> separate catalog listing compression methods. If we had that, I think we
>> >> >> could simply add dependencies between attributes and that catalog.
>> >> >
>> >> >Basically, up to this patch, we are having only built-in compression
>> >> >methods and those can not be dropped so we don't need any dependency
>> >> >at all. We just want to know what is the current compression method
>> >> >and what is the preserve compression methods supported for this
>> >> >attribute. Maybe we can do it better instead of using the NameData
>> >> >but I don't think it makes sense to add a separate catalog?
>> >> >
>> >>
>> >> Sure, I understand what the goal was - all I'm saying is that it looks
>> >> very much like a workaround needed because we don't have the catalog.
>> >>
>> >> I don't quite understand how could we support custom compression methods
>> >> without listing them in some sort of catalog?
>> >
>> >Yeah for supporting custom compression we need some catalog.
>> >
>> >> >> Moreover, having the catalog would allow adding compression methods
>> >> >> (from extensions etc) instead of just having a list of hard-coded
>> >> >> compression methods. Which seems like a strange limitation, considering
>> >> >> this thread is called "custom compression methods".
>> >> >
>> >> >I think I forgot to mention while submitting the previous patch that
>> >> >the next patch I am planning to submit is, Support creating the custom
>> >> >compression methods wherein we can use pg_am catalog to insert the new
>> >> >compression method. And for dependency handling, we can create an
>> >> >attribute dependency on the pg_am row. Basically, we will create the
>> >> >attribute dependency on the current compression method AM as well as
>> >> >on the preserved compression methods AM. As part of this, we will
>> >> >add two build-in AMs for zlib and pglz, and the attcompression field
>> >> >will be converted to the oid_vector (first OID will be of the current
>> >> >compression method, followed by the preserved compression method's
>> >> >oids).
>> >> >
>> >>
>> >> Hmmm, ok. Not sure pg_am is the right place - compression methods don't
>> >> quite match what I though AMs are about, but maybe it's just my fault.
>> >>
>> >> FWIW it seems a bit strange to first do the attcompression magic and
>> >> then add the catalog later - I think we should start with the catalog
>> >> right away. The advantage is that if we end up committing only some of
>> >> the patches in this cycle, we already have all the infrastructure etc.
>> >> We can reorder that later, though.
>> >
>> >Hmm, yeah we can do this way as well that first create a new catalog
>> >table and add entries for these two built-in methods and the
>> >attcompression can store the oid vector. But if we only commit the
>> >build-in compression methods part then does it make sense to create an
>> >extra catalog or adding these build-in methods to the existing catalog
>> >(if we plan to use pg_am). Then in attcompression instead of using
>> >one byte for each preserve compression method, we need to use oid. So
>> >from Robert's mail[1], it appeared to me that he wants that the
>> >build-in compression methods part should be independently committable
>> >and if we think from that perspective then adding a catalog doesn't
>> >make much sense. But if we are planning to commit the custom method
>> >also then it makes more sense to directly start with the catalog
>> >because that way it will be easy to expand without much refactoring.
>> >
>> >[1] https://www.postgresql.org/message-id/CA%2BTgmobSDVgUage9qQ5P_%3DF_9jaMkCgyKxUQGtFQU7oN4kX-AA%40mail.gmail.com
>> >
>>
>> Hmmm. Maybe I'm missing something subtle, but I think that plan can be
>> interpreted in various ways - it does not really say whether the initial
>> list of built-in methods should be in some C array, or already in a proper
>> catalog.
>>
>> All I'm saying is it seems a bit weird to first implement dependencies
>> based on strange (mis)use of attcompression attribute, and then replace
>> it with a proper catalog. My understanding is those patches are expected
>> to be committable one by one, but the attcompression approach seems a
>> bit too hacky to me - not sure I'd want to commit that ...
>
>Okay, I will change this. So I will make create a new catalog
>pg_compression and add the entry for two built-in compression methods
>from the very first patch.
>

OK.

>> >> >> 10) compression parameters?
>> >> >>
>> >> >> I wonder if we could/should allow parameters, like compression level
>> >> >> (and maybe other stuff, depending on the compression method). PG13
>> >> >> allowed that for opclasses, so perhaps we should allow it here too.
>> >> >
>> >> >Yes, that is also in the plan. For doing this we are planning to add
>> >> >an extra column in the pg_attribute which will store the compression
>> >> >options for the current compression method. The original patch was
>> >> >creating an extra catalog pg_column_compression, therein it maintains
>> >> >the oid of the compression method as well as the compression options.
>> >> >The advantage of creating an extra catalog is that we can keep the
>> >> >compression options for the preserved compression methods also so that
>> >> >we can support the options which can be used for decompressing the
>> >> >data as well. Whereas if we want to avoid this extra catalog then we
>> >> >can not use that compression option for decompressing. But most of
>> >> >the options e.g. compression level are just for the compressing so it
>> >> >is enough to store for the current compression method only. What's
>> >> >your thoughts?
>> >> >
>> >>
>> >> Not sure. My assumption was we'd end up with a new catalog, but maybe
>> >> stashing it into pg_attribute is fine. I was really thinking about two
>> >> kinds of options - compression level, and some sort of column-level
>> >> dictionary. Compression level is not necessary for decompression, but
>> >> the dictionary ID would be needed. (I think the global dictionary was
>> >> one of the use cases, aimed at JSON compression.)
>> >
>> >Ok
>> >
>> >> But I don't think stashing it in pg_attribute means we couldn't use it
>> >> for decompression - we'd just need to keep an array of options, one for
>> >> each compression method.
>> >
>> >Yeah, we can do that.
>> >
>> >Keeping it in a separate new catalog might be
>> >> cleaner, and I'm not sure how large the configuration might be.
>> >
>> >Yeah in that case it will be better to store in a separate catalog,
>> >because sometimes if multiple attributes are using the same
>> >compression method with the same options then we can store the same
>> >oid in attcompression instead of duplicating the option field.
>> >
>>
>> I doubt deduplicating the options like this is (sharing options between
>> columns) is really worth it, as it means extra complexity e.g. during
>> ALTER TABLE ... SET COMPRESSION. I don't think we do that for other
>> catalogs, so why should we do it here?
>
>Yeah, valid point.
>
>>
>> Ultimately I think it's a question of how large we expect the options to
>> be, and how flexible it needs to be.
>>
>> For example, what happens if the user does this:
>>
>> ALTER ... SET COMPRESSION my_compression WITH (options1) PRESERVE;
>> ALTER ... SET COMPRESSION pglz PRESERVE;
>> ALTER ... SET COMPRESSION my_compression WITH (options2) PRESERVE;
>>
>> I believe it's enough to keep just the last value, but maybe I'm wrong
>> and we need to keep the whole history?
>
>Currently, the syntax is like ALTER ... SET COMPRESSION my_compression
>WITH (options1) PRESERVE (old_compression1, old_compression2..). But I
>think if the user just gives PRESERVE without a list then we should
>just preserve the latest one.
>

Hmmm. Not sure that's very convenient. I'd expect the most common use
case for PRESERVE being "I want to change compression for new data,
without rewrite". If PRESERVE by default preserves the latest one, that
pretty much forces users to always list all methods. I suggest
iterpreting it as "preserve everything" instead.

Another option would be to require either a list of methods, or some
keyword defining what to preserve. Like for example

... PRESERVE (m1, m2, ...)
... PRESERVE ALL
... PRESERVE LAST

Does that make sense?

>> The use case I'm thinking about is the column-level JSON compression,
>> where one of the options identifies the dictionary. OTOH I'm not sure
>> this is the right way to track this info - we need to know which options
>> were compressed with which options, i.e. it needs to be encoded in each
>> value directly. It'd also require changes to the PRESERVE handling
>> because it'd be necessary to identify which options to preserve ...
>>
>> So maybe this is either nonsense or something we don't want to support,
>> and we should only allow one option for each compression method.
>
>Yeah, it is a bit confusing to add the same compression method with
>different compression options, then in the preserve list, we will
>have to allow the option as well along with the compression method to
>know which compression method with what options we want to preserve.
>
>And also as you mentioned that in rows we need to know the option as
>well. I think for solving this anyways for the custom compression
>methods we will have to store the OID of the compression method in the
>toast header so we can provide an intermediate catalog which will
>create a new row for each combination of compression method + option
>and the toast header can store the OID of that row so that we know
>with which compression method + option it was compressed with.
>

I agree. After thinking about this a bit more, I think we should just
keep the last options for each compression method. If we need to allow
multiple options for some future compression method, we can improve
this, but until then it'd be an over-engineering. Let's do the simplest
possible thing here.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, David Steele <david(at)pgmasters(dot)net>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2020-10-07 04:56:36
Message-ID:	CAFiTN-u4S0_ZmtPdt4dvuw7U3x_eiYn+iwwTc17hivp-cCyKog@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Oct 6, 2020 at 10:21 PM Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>
> On Tue, Oct 06, 2020 at 11:00:55AM +0530, Dilip Kumar wrote:
> >On Mon, Oct 5, 2020 at 9:34 PM Tomas Vondra
> ><tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> >>
> >> On Mon, Oct 05, 2020 at 07:57:41PM +0530, Dilip Kumar wrote:
> >> >On Mon, Oct 5, 2020 at 5:53 PM Tomas Vondra
> >> ><tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> >> >>
> >> >> On Mon, Oct 05, 2020 at 11:17:28AM +0530, Dilip Kumar wrote:
> >> >> >On Mon, Oct 5, 2020 at 3:37 AM Tomas Vondra
> >> >> ><tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> >> >> >
> >> >> >Thanks, Tomas for your feedback.
> >> >> >
> >> >> >> 9) attcompression ...
> >> >> >>
> >> >> >> The main issue I see is what the patch does with attcompression. Instead
> >> >> >> of just using it to store a the compression method, it's also used to
> >> >> >> store the preserved compression methods. And using NameData to store
> >> >> >> this seems wrong too - if we really want to store this info, the correct
> >> >> >> way is either using text[] or inventing charvector or similar.
> >> >> >
> >> >> >The reason for using the NameData is the get it in the fixed part of
> >> >> >the data structure.
> >> >> >
> >> >>
> >> >> Why do we need that? It's possible to have varlena fields with direct
> >> >> access (see pg_index.indkey for example).
> >> >
> >> >I see. While making it NameData I was thinking whether we have an
> >> >option to direct access the varlena. Thanks for pointing me there. I
> >> >will change this.
> >> >
> >> > Adding NameData just to make
> >> >> it fixed-length means we're always adding 64B even if we just need a
> >> >> single byte, which means ~30% overhead for the FormData_pg_attribute.
> >> >> That seems a bit unnecessary, and might be an issue with many attributes
> >> >> (e.g. with many temp tables, etc.).
> >> >
> >> >You are right. Even I did not like to keep 64B for this, so I will change it.
> >> >
> >> >>
> >> >> >> But to me this seems very much like a misuse of attcompression to track
> >> >> >> dependencies on compression methods, necessary because we don't have a
> >> >> >> separate catalog listing compression methods. If we had that, I think we
> >> >> >> could simply add dependencies between attributes and that catalog.
> >> >> >
> >> >> >Basically, up to this patch, we are having only built-in compression
> >> >> >methods and those can not be dropped so we don't need any dependency
> >> >> >at all. We just want to know what is the current compression method
> >> >> >and what is the preserve compression methods supported for this
> >> >> >attribute. Maybe we can do it better instead of using the NameData
> >> >> >but I don't think it makes sense to add a separate catalog?
> >> >> >
> >> >>
> >> >> Sure, I understand what the goal was - all I'm saying is that it looks
> >> >> very much like a workaround needed because we don't have the catalog.
> >> >>
> >> >> I don't quite understand how could we support custom compression methods
> >> >> without listing them in some sort of catalog?
> >> >
> >> >Yeah for supporting custom compression we need some catalog.
> >> >
> >> >> >> Moreover, having the catalog would allow adding compression methods
> >> >> >> (from extensions etc) instead of just having a list of hard-coded
> >> >> >> compression methods. Which seems like a strange limitation, considering
> >> >> >> this thread is called "custom compression methods".
> >> >> >
> >> >> >I think I forgot to mention while submitting the previous patch that
> >> >> >the next patch I am planning to submit is, Support creating the custom
> >> >> >compression methods wherein we can use pg_am catalog to insert the new
> >> >> >compression method. And for dependency handling, we can create an
> >> >> >attribute dependency on the pg_am row. Basically, we will create the
> >> >> >attribute dependency on the current compression method AM as well as
> >> >> >on the preserved compression methods AM. As part of this, we will
> >> >> >add two build-in AMs for zlib and pglz, and the attcompression field
> >> >> >will be converted to the oid_vector (first OID will be of the current
> >> >> >compression method, followed by the preserved compression method's
> >> >> >oids).
> >> >> >
> >> >>
> >> >> Hmmm, ok. Not sure pg_am is the right place - compression methods don't
> >> >> quite match what I though AMs are about, but maybe it's just my fault.
> >> >>
> >> >> FWIW it seems a bit strange to first do the attcompression magic and
> >> >> then add the catalog later - I think we should start with the catalog
> >> >> right away. The advantage is that if we end up committing only some of
> >> >> the patches in this cycle, we already have all the infrastructure etc.
> >> >> We can reorder that later, though.
> >> >
> >> >Hmm, yeah we can do this way as well that first create a new catalog
> >> >table and add entries for these two built-in methods and the
> >> >attcompression can store the oid vector. But if we only commit the
> >> >build-in compression methods part then does it make sense to create an
> >> >extra catalog or adding these build-in methods to the existing catalog
> >> >(if we plan to use pg_am). Then in attcompression instead of using
> >> >one byte for each preserve compression method, we need to use oid. So
> >> >from Robert's mail[1], it appeared to me that he wants that the
> >> >build-in compression methods part should be independently committable
> >> >and if we think from that perspective then adding a catalog doesn't
> >> >make much sense. But if we are planning to commit the custom method
> >> >also then it makes more sense to directly start with the catalog
> >> >because that way it will be easy to expand without much refactoring.
> >> >
> >> >[1] https://www.postgresql.org/message-id/CA%2BTgmobSDVgUage9qQ5P_%3DF_9jaMkCgyKxUQGtFQU7oN4kX-AA%40mail.gmail.com
> >> >
> >>
> >> Hmmm. Maybe I'm missing something subtle, but I think that plan can be
> >> interpreted in various ways - it does not really say whether the initial
> >> list of built-in methods should be in some C array, or already in a proper
> >> catalog.
> >>
> >> All I'm saying is it seems a bit weird to first implement dependencies
> >> based on strange (mis)use of attcompression attribute, and then replace
> >> it with a proper catalog. My understanding is those patches are expected
> >> to be committable one by one, but the attcompression approach seems a
> >> bit too hacky to me - not sure I'd want to commit that ...
> >
> >Okay, I will change this. So I will make create a new catalog
> >pg_compression and add the entry for two built-in compression methods
> >from the very first patch.
> >
>
> OK.
>
> >> >> >> 10) compression parameters?
> >> >> >>
> >> >> >> I wonder if we could/should allow parameters, like compression level
> >> >> >> (and maybe other stuff, depending on the compression method). PG13
> >> >> >> allowed that for opclasses, so perhaps we should allow it here too.
> >> >> >
> >> >> >Yes, that is also in the plan. For doing this we are planning to add
> >> >> >an extra column in the pg_attribute which will store the compression
> >> >> >options for the current compression method. The original patch was
> >> >> >creating an extra catalog pg_column_compression, therein it maintains
> >> >> >the oid of the compression method as well as the compression options.
> >> >> >The advantage of creating an extra catalog is that we can keep the
> >> >> >compression options for the preserved compression methods also so that
> >> >> >we can support the options which can be used for decompressing the
> >> >> >data as well. Whereas if we want to avoid this extra catalog then we
> >> >> >can not use that compression option for decompressing. But most of
> >> >> >the options e.g. compression level are just for the compressing so it
> >> >> >is enough to store for the current compression method only. What's
> >> >> >your thoughts?
> >> >> >
> >> >>
> >> >> Not sure. My assumption was we'd end up with a new catalog, but maybe
> >> >> stashing it into pg_attribute is fine. I was really thinking about two
> >> >> kinds of options - compression level, and some sort of column-level
> >> >> dictionary. Compression level is not necessary for decompression, but
> >> >> the dictionary ID would be needed. (I think the global dictionary was
> >> >> one of the use cases, aimed at JSON compression.)
> >> >
> >> >Ok
> >> >
> >> >> But I don't think stashing it in pg_attribute means we couldn't use it
> >> >> for decompression - we'd just need to keep an array of options, one for
> >> >> each compression method.
> >> >
> >> >Yeah, we can do that.
> >> >
> >> >Keeping it in a separate new catalog might be
> >> >> cleaner, and I'm not sure how large the configuration might be.
> >> >
> >> >Yeah in that case it will be better to store in a separate catalog,
> >> >because sometimes if multiple attributes are using the same
> >> >compression method with the same options then we can store the same
> >> >oid in attcompression instead of duplicating the option field.
> >> >
> >>
> >> I doubt deduplicating the options like this is (sharing options between
> >> columns) is really worth it, as it means extra complexity e.g. during
> >> ALTER TABLE ... SET COMPRESSION. I don't think we do that for other
> >> catalogs, so why should we do it here?
> >
> >Yeah, valid point.
> >
> >>
> >> Ultimately I think it's a question of how large we expect the options to
> >> be, and how flexible it needs to be.
> >>
> >> For example, what happens if the user does this:
> >>
> >> ALTER ... SET COMPRESSION my_compression WITH (options1) PRESERVE;
> >> ALTER ... SET COMPRESSION pglz PRESERVE;
> >> ALTER ... SET COMPRESSION my_compression WITH (options2) PRESERVE;
> >>
> >> I believe it's enough to keep just the last value, but maybe I'm wrong
> >> and we need to keep the whole history?
> >
> >Currently, the syntax is like ALTER ... SET COMPRESSION my_compression
> >WITH (options1) PRESERVE (old_compression1, old_compression2..). But I
> >think if the user just gives PRESERVE without a list then we should
> >just preserve the latest one.
> >
>
> Hmmm. Not sure that's very convenient. I'd expect the most common use
> case for PRESERVE being "I want to change compression for new data,
> without rewrite". If PRESERVE by default preserves the latest one, that
> pretty much forces users to always list all methods. I suggest
> iterpreting it as "preserve everything" instead.
>
> Another option would be to require either a list of methods, or some
> keyword defining what to preserve. Like for example
>
> ... PRESERVE (m1, m2, ...)
> ... PRESERVE ALL
> ... PRESERVE LAST
>
> Does that make sense?

Yeah, this makes sense to me.

>
> >> The use case I'm thinking about is the column-level JSON compression,
> >> where one of the options identifies the dictionary. OTOH I'm not sure
> >> this is the right way to track this info - we need to know which options
> >> were compressed with which options, i.e. it needs to be encoded in each
> >> value directly. It'd also require changes to the PRESERVE handling
> >> because it'd be necessary to identify which options to preserve ...
> >>
> >> So maybe this is either nonsense or something we don't want to support,
> >> and we should only allow one option for each compression method.
> >
> >Yeah, it is a bit confusing to add the same compression method with
> >different compression options, then in the preserve list, we will
> >have to allow the option as well along with the compression method to
> >know which compression method with what options we want to preserve.
> >
> >And also as you mentioned that in rows we need to know the option as
> >well. I think for solving this anyways for the custom compression
> >methods we will have to store the OID of the compression method in the
> >toast header so we can provide an intermediate catalog which will
> >create a new row for each combination of compression method + option
> >and the toast header can store the OID of that row so that we know
> >with which compression method + option it was compressed with.
> >
>
> I agree. After thinking about this a bit more, I think we should just
> keep the last options for each compression method. If we need to allow
> multiple options for some future compression method, we can improve
> this, but until then it'd be an over-engineering. Let's do the simplest
> possible thing here.

Okay.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

From:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, David Steele <david(at)pgmasters(dot)net>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2020-10-07 11:30:12
Message-ID:	CAFiTN-s6WPyos5O7Z6OKSh+gQSq2GdyoJJJwRidar9DeN5GzMw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Oct 7, 2020 at 10:26 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> On Tue, Oct 6, 2020 at 10:21 PM Tomas Vondra
> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> >
> > On Tue, Oct 06, 2020 at 11:00:55AM +0530, Dilip Kumar wrote:
> > >On Mon, Oct 5, 2020 at 9:34 PM Tomas Vondra
> > ><tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> > >>
> > >> On Mon, Oct 05, 2020 at 07:57:41PM +0530, Dilip Kumar wrote:
> > >> >On Mon, Oct 5, 2020 at 5:53 PM Tomas Vondra
> > >> ><tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> > >> >>
> > >> >> On Mon, Oct 05, 2020 at 11:17:28AM +0530, Dilip Kumar wrote:
> > >> >> >On Mon, Oct 5, 2020 at 3:37 AM Tomas Vondra
> > >> >> ><tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> > >> >> >
> > >> >> >Thanks, Tomas for your feedback.
> > >> >> >
> > >> >> >> 9) attcompression ...
> > >> >> >>
> > >> >> >> The main issue I see is what the patch does with attcompression. Instead
> > >> >> >> of just using it to store a the compression method, it's also used to
> > >> >> >> store the preserved compression methods. And using NameData to store
> > >> >> >> this seems wrong too - if we really want to store this info, the correct
> > >> >> >> way is either using text[] or inventing charvector or similar.
> > >> >> >
> > >> >> >The reason for using the NameData is the get it in the fixed part of
> > >> >> >the data structure.
> > >> >> >
> > >> >>
> > >> >> Why do we need that? It's possible to have varlena fields with direct
> > >> >> access (see pg_index.indkey for example).
> > >> >
> > >> >I see. While making it NameData I was thinking whether we have an
> > >> >option to direct access the varlena. Thanks for pointing me there. I
> > >> >will change this.
> > >> >
> > >> > Adding NameData just to make
> > >> >> it fixed-length means we're always adding 64B even if we just need a
> > >> >> single byte, which means ~30% overhead for the FormData_pg_attribute.
> > >> >> That seems a bit unnecessary, and might be an issue with many attributes
> > >> >> (e.g. with many temp tables, etc.).
> > >> >
> > >> >You are right. Even I did not like to keep 64B for this, so I will change it.
> > >> >
> > >> >>
> > >> >> >> But to me this seems very much like a misuse of attcompression to track
> > >> >> >> dependencies on compression methods, necessary because we don't have a
> > >> >> >> separate catalog listing compression methods. If we had that, I think we
> > >> >> >> could simply add dependencies between attributes and that catalog.
> > >> >> >
> > >> >> >Basically, up to this patch, we are having only built-in compression
> > >> >> >methods and those can not be dropped so we don't need any dependency
> > >> >> >at all. We just want to know what is the current compression method
> > >> >> >and what is the preserve compression methods supported for this
> > >> >> >attribute. Maybe we can do it better instead of using the NameData
> > >> >> >but I don't think it makes sense to add a separate catalog?
> > >> >> >
> > >> >>
> > >> >> Sure, I understand what the goal was - all I'm saying is that it looks
> > >> >> very much like a workaround needed because we don't have the catalog.
> > >> >>
> > >> >> I don't quite understand how could we support custom compression methods
> > >> >> without listing them in some sort of catalog?
> > >> >
> > >> >Yeah for supporting custom compression we need some catalog.
> > >> >
> > >> >> >> Moreover, having the catalog would allow adding compression methods
> > >> >> >> (from extensions etc) instead of just having a list of hard-coded
> > >> >> >> compression methods. Which seems like a strange limitation, considering
> > >> >> >> this thread is called "custom compression methods".
> > >> >> >
> > >> >> >I think I forgot to mention while submitting the previous patch that
> > >> >> >the next patch I am planning to submit is, Support creating the custom
> > >> >> >compression methods wherein we can use pg_am catalog to insert the new
> > >> >> >compression method. And for dependency handling, we can create an
> > >> >> >attribute dependency on the pg_am row. Basically, we will create the
> > >> >> >attribute dependency on the current compression method AM as well as
> > >> >> >on the preserved compression methods AM. As part of this, we will
> > >> >> >add two build-in AMs for zlib and pglz, and the attcompression field
> > >> >> >will be converted to the oid_vector (first OID will be of the current
> > >> >> >compression method, followed by the preserved compression method's
> > >> >> >oids).
> > >> >> >
> > >> >>
> > >> >> Hmmm, ok. Not sure pg_am is the right place - compression methods don't
> > >> >> quite match what I though AMs are about, but maybe it's just my fault.
> > >> >>
> > >> >> FWIW it seems a bit strange to first do the attcompression magic and
> > >> >> then add the catalog later - I think we should start with the catalog
> > >> >> right away. The advantage is that if we end up committing only some of
> > >> >> the patches in this cycle, we already have all the infrastructure etc.
> > >> >> We can reorder that later, though.
> > >> >
> > >> >Hmm, yeah we can do this way as well that first create a new catalog
> > >> >table and add entries for these two built-in methods and the
> > >> >attcompression can store the oid vector. But if we only commit the
> > >> >build-in compression methods part then does it make sense to create an
> > >> >extra catalog or adding these build-in methods to the existing catalog
> > >> >(if we plan to use pg_am). Then in attcompression instead of using
> > >> >one byte for each preserve compression method, we need to use oid. So
> > >> >from Robert's mail[1], it appeared to me that he wants that the
> > >> >build-in compression methods part should be independently committable
> > >> >and if we think from that perspective then adding a catalog doesn't
> > >> >make much sense. But if we are planning to commit the custom method
> > >> >also then it makes more sense to directly start with the catalog
> > >> >because that way it will be easy to expand without much refactoring.
> > >> >
> > >> >[1] https://www.postgresql.org/message-id/CA%2BTgmobSDVgUage9qQ5P_%3DF_9jaMkCgyKxUQGtFQU7oN4kX-AA%40mail.gmail.com
> > >> >
> > >>
> > >> Hmmm. Maybe I'm missing something subtle, but I think that plan can be
> > >> interpreted in various ways - it does not really say whether the initial
> > >> list of built-in methods should be in some C array, or already in a proper
> > >> catalog.
> > >>
> > >> All I'm saying is it seems a bit weird to first implement dependencies
> > >> based on strange (mis)use of attcompression attribute, and then replace
> > >> it with a proper catalog. My understanding is those patches are expected
> > >> to be committable one by one, but the attcompression approach seems a
> > >> bit too hacky to me - not sure I'd want to commit that ...
> > >
> > >Okay, I will change this. So I will make create a new catalog
> > >pg_compression and add the entry for two built-in compression methods
> > >from the very first patch.
> > >
> >
> > OK.

I have changed the first 2 patches, basically, now we are providing a
new catalog pg_compression and the pg_attribute is storing the oid of
the compression method. The patches still need some cleanup and there
is also one open comment that for index we should use its table
compression.

I am still working on the preserve patch. For preserving the
compression method I am planning to convert the attcompression field
to the oidvector so that we can store the oid of the preserve method
also. I am not sure whether we can access this oidvector as a fixed
part of the FormData_pg_attribute or not. The reason is that for
building the tuple descriptor, we need to give the size of the fixed
part (#define ATTRIBUTE_FIXED_PART_SIZE \
(offsetof(FormData_pg_attribute,attcompression) + sizeof(Oid))). But
if we convert this to the oidvector then we don't know the size of the
fixed part. Am I missing something?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment	Content-Type	Size
v4-0002-alter-table-set-compression.patch	application/octet-stream	11.9 KB
v4-0001-Built-in-compression-method.patch	application/octet-stream	198.5 KB

From:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, David Steele <david(at)pgmasters(dot)net>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2020-10-08 09:08:27
Message-ID:	CAFiTN-tYAf7AGpAC6Ofyq6oUCSK-k6Vdbr58go2ZQ3QY=T3=4A@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Oct 7, 2020 at 5:00 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> On Wed, Oct 7, 2020 at 10:26 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> >
> > On Tue, Oct 6, 2020 at 10:21 PM Tomas Vondra
> > <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> > >
> > > On Tue, Oct 06, 2020 at 11:00:55AM +0530, Dilip Kumar wrote:
> > > >On Mon, Oct 5, 2020 at 9:34 PM Tomas Vondra
> > > ><tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> > > >>
> > > >> On Mon, Oct 05, 2020 at 07:57:41PM +0530, Dilip Kumar wrote:
> > > >> >On Mon, Oct 5, 2020 at 5:53 PM Tomas Vondra
> > > >> ><tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> > > >> >>
> > > >> >> On Mon, Oct 05, 2020 at 11:17:28AM +0530, Dilip Kumar wrote:
> > > >> >> >On Mon, Oct 5, 2020 at 3:37 AM Tomas Vondra
> > > >> >> ><tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> > > >> >> >
> > > >> >> >Thanks, Tomas for your feedback.
> > > >> >> >
> > > >> >> >> 9) attcompression ...
> > > >> >> >>
> > > >> >> >> The main issue I see is what the patch does with attcompression. Instead
> > > >> >> >> of just using it to store a the compression method, it's also used to
> > > >> >> >> store the preserved compression methods. And using NameData to store
> > > >> >> >> this seems wrong too - if we really want to store this info, the correct
> > > >> >> >> way is either using text[] or inventing charvector or similar.
> > > >> >> >
> > > >> >> >The reason for using the NameData is the get it in the fixed part of
> > > >> >> >the data structure.
> > > >> >> >
> > > >> >>
> > > >> >> Why do we need that? It's possible to have varlena fields with direct
> > > >> >> access (see pg_index.indkey for example).
> > > >> >
> > > >> >I see. While making it NameData I was thinking whether we have an
> > > >> >option to direct access the varlena. Thanks for pointing me there. I
> > > >> >will change this.
> > > >> >
> > > >> > Adding NameData just to make
> > > >> >> it fixed-length means we're always adding 64B even if we just need a
> > > >> >> single byte, which means ~30% overhead for the FormData_pg_attribute.
> > > >> >> That seems a bit unnecessary, and might be an issue with many attributes
> > > >> >> (e.g. with many temp tables, etc.).
> > > >> >
> > > >> >You are right. Even I did not like to keep 64B for this, so I will change it.
> > > >> >
> > > >> >>
> > > >> >> >> But to me this seems very much like a misuse of attcompression to track
> > > >> >> >> dependencies on compression methods, necessary because we don't have a
> > > >> >> >> separate catalog listing compression methods. If we had that, I think we
> > > >> >> >> could simply add dependencies between attributes and that catalog.
> > > >> >> >
> > > >> >> >Basically, up to this patch, we are having only built-in compression
> > > >> >> >methods and those can not be dropped so we don't need any dependency
> > > >> >> >at all. We just want to know what is the current compression method
> > > >> >> >and what is the preserve compression methods supported for this
> > > >> >> >attribute. Maybe we can do it better instead of using the NameData
> > > >> >> >but I don't think it makes sense to add a separate catalog?
> > > >> >> >
> > > >> >>
> > > >> >> Sure, I understand what the goal was - all I'm saying is that it looks
> > > >> >> very much like a workaround needed because we don't have the catalog.
> > > >> >>
> > > >> >> I don't quite understand how could we support custom compression methods
> > > >> >> without listing them in some sort of catalog?
> > > >> >
> > > >> >Yeah for supporting custom compression we need some catalog.
> > > >> >
> > > >> >> >> Moreover, having the catalog would allow adding compression methods
> > > >> >> >> (from extensions etc) instead of just having a list of hard-coded
> > > >> >> >> compression methods. Which seems like a strange limitation, considering
> > > >> >> >> this thread is called "custom compression methods".
> > > >> >> >
> > > >> >> >I think I forgot to mention while submitting the previous patch that
> > > >> >> >the next patch I am planning to submit is, Support creating the custom
> > > >> >> >compression methods wherein we can use pg_am catalog to insert the new
> > > >> >> >compression method. And for dependency handling, we can create an
> > > >> >> >attribute dependency on the pg_am row. Basically, we will create the
> > > >> >> >attribute dependency on the current compression method AM as well as
> > > >> >> >on the preserved compression methods AM. As part of this, we will
> > > >> >> >add two build-in AMs for zlib and pglz, and the attcompression field
> > > >> >> >will be converted to the oid_vector (first OID will be of the current
> > > >> >> >compression method, followed by the preserved compression method's
> > > >> >> >oids).
> > > >> >> >
> > > >> >>
> > > >> >> Hmmm, ok. Not sure pg_am is the right place - compression methods don't
> > > >> >> quite match what I though AMs are about, but maybe it's just my fault.
> > > >> >>
> > > >> >> FWIW it seems a bit strange to first do the attcompression magic and
> > > >> >> then add the catalog later - I think we should start with the catalog
> > > >> >> right away. The advantage is that if we end up committing only some of
> > > >> >> the patches in this cycle, we already have all the infrastructure etc.
> > > >> >> We can reorder that later, though.
> > > >> >
> > > >> >Hmm, yeah we can do this way as well that first create a new catalog
> > > >> >table and add entries for these two built-in methods and the
> > > >> >attcompression can store the oid vector. But if we only commit the
> > > >> >build-in compression methods part then does it make sense to create an
> > > >> >extra catalog or adding these build-in methods to the existing catalog
> > > >> >(if we plan to use pg_am). Then in attcompression instead of using
> > > >> >one byte for each preserve compression method, we need to use oid. So
> > > >> >from Robert's mail[1], it appeared to me that he wants that the
> > > >> >build-in compression methods part should be independently committable
> > > >> >and if we think from that perspective then adding a catalog doesn't
> > > >> >make much sense. But if we are planning to commit the custom method
> > > >> >also then it makes more sense to directly start with the catalog
> > > >> >because that way it will be easy to expand without much refactoring.
> > > >> >
> > > >> >[1] https://www.postgresql.org/message-id/CA%2BTgmobSDVgUage9qQ5P_%3DF_9jaMkCgyKxUQGtFQU7oN4kX-AA%40mail.gmail.com
> > > >> >
> > > >>
> > > >> Hmmm. Maybe I'm missing something subtle, but I think that plan can be
> > > >> interpreted in various ways - it does not really say whether the initial
> > > >> list of built-in methods should be in some C array, or already in a proper
> > > >> catalog.
> > > >>
> > > >> All I'm saying is it seems a bit weird to first implement dependencies
> > > >> based on strange (mis)use of attcompression attribute, and then replace
> > > >> it with a proper catalog. My understanding is those patches are expected
> > > >> to be committable one by one, but the attcompression approach seems a
> > > >> bit too hacky to me - not sure I'd want to commit that ...
> > > >
> > > >Okay, I will change this. So I will make create a new catalog
> > > >pg_compression and add the entry for two built-in compression methods
> > > >from the very first patch.
> > > >
> > >
> > > OK.
>
> I have changed the first 2 patches, basically, now we are providing a
> new catalog pg_compression and the pg_attribute is storing the oid of
> the compression method. The patches still need some cleanup and there
> is also one open comment that for index we should use its table
> compression.
>
> I am still working on the preserve patch. For preserving the
> compression method I am planning to convert the attcompression field
> to the oidvector so that we can store the oid of the preserve method
> also. I am not sure whether we can access this oidvector as a fixed
> part of the FormData_pg_attribute or not. The reason is that for
> building the tuple descriptor, we need to give the size of the fixed
> part (#define ATTRIBUTE_FIXED_PART_SIZE \
> (offsetof(FormData_pg_attribute,attcompression) + sizeof(Oid))). But
> if we convert this to the oidvector then we don't know the size of the
> fixed part. Am I missing something?

I could think of two solutions here
Sol1.
Make the first oid of the oidvector as part of the fixed size, like below
#define ATTRIBUTE_FIXED_PART_SIZE \
(offsetof(FormData_pg_attribute, attcompression) + OidVectorSize(1))

Sol2:
Keep attcompression as oid only and for the preserve list, adds
another field in the variable part which will be of type oidvector. I
think most of the time we need to access the current compression
method and with this solution, we will be able to access that as part
of the tuple desc.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

From:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, David Steele <david(at)pgmasters(dot)net>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2020-10-08 14:24:42
Message-ID:	CAFiTN-vCwrAXXUJ5tpMyX0n7WymL2qDiSo7HbdoCCqmOFeibLg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

There was some unwanted code in the previous patch so attaching the
updated patches.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment	Content-Type	Size
v5-0002-alter-table-set-compression.patch	application/octet-stream	11.9 KB
v5-0001-Built-in-compression-method.patch	application/octet-stream	198.1 KB

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, David Steele <david(at)pgmasters(dot)net>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2020-10-08 21:54:08
Message-ID:	20201008215408.bvrdmt7z2qug7lan@development
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Oct 08, 2020 at 02:38:27PM +0530, Dilip Kumar wrote:
>On Wed, Oct 7, 2020 at 5:00 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>>
>> On Wed, Oct 7, 2020 at 10:26 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>> >
>> > On Tue, Oct 6, 2020 at 10:21 PM Tomas Vondra
>> > <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> > >
>> > > On Tue, Oct 06, 2020 at 11:00:55AM +0530, Dilip Kumar wrote:
>> > > >On Mon, Oct 5, 2020 at 9:34 PM Tomas Vondra
>> > > ><tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> > > >>
>> > > >> On Mon, Oct 05, 2020 at 07:57:41PM +0530, Dilip Kumar wrote:
>> > > >> >On Mon, Oct 5, 2020 at 5:53 PM Tomas Vondra
>> > > >> ><tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> > > >> >>
>> > > >> >> On Mon, Oct 05, 2020 at 11:17:28AM +0530, Dilip Kumar wrote:
>> > > >> >> >On Mon, Oct 5, 2020 at 3:37 AM Tomas Vondra
>> > > >> >> ><tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> > > >> >> >
>> > > >> >> >Thanks, Tomas for your feedback.
>> > > >> >> >
>> > > >> >> >> 9) attcompression ...
>> > > >> >> >>
>> > > >> >> >> The main issue I see is what the patch does with attcompression. Instead
>> > > >> >> >> of just using it to store a the compression method, it's also used to
>> > > >> >> >> store the preserved compression methods. And using NameData to store
>> > > >> >> >> this seems wrong too - if we really want to store this info, the correct
>> > > >> >> >> way is either using text[] or inventing charvector or similar.
>> > > >> >> >
>> > > >> >> >The reason for using the NameData is the get it in the fixed part of
>> > > >> >> >the data structure.
>> > > >> >> >
>> > > >> >>
>> > > >> >> Why do we need that? It's possible to have varlena fields with direct
>> > > >> >> access (see pg_index.indkey for example).
>> > > >> >
>> > > >> >I see. While making it NameData I was thinking whether we have an
>> > > >> >option to direct access the varlena. Thanks for pointing me there. I
>> > > >> >will change this.
>> > > >> >
>> > > >> > Adding NameData just to make
>> > > >> >> it fixed-length means we're always adding 64B even if we just need a
>> > > >> >> single byte, which means ~30% overhead for the FormData_pg_attribute.
>> > > >> >> That seems a bit unnecessary, and might be an issue with many attributes
>> > > >> >> (e.g. with many temp tables, etc.).
>> > > >> >
>> > > >> >You are right. Even I did not like to keep 64B for this, so I will change it.
>> > > >> >
>> > > >> >>
>> > > >> >> >> But to me this seems very much like a misuse of attcompression to track
>> > > >> >> >> dependencies on compression methods, necessary because we don't have a
>> > > >> >> >> separate catalog listing compression methods. If we had that, I think we
>> > > >> >> >> could simply add dependencies between attributes and that catalog.
>> > > >> >> >
>> > > >> >> >Basically, up to this patch, we are having only built-in compression
>> > > >> >> >methods and those can not be dropped so we don't need any dependency
>> > > >> >> >at all. We just want to know what is the current compression method
>> > > >> >> >and what is the preserve compression methods supported for this
>> > > >> >> >attribute. Maybe we can do it better instead of using the NameData
>> > > >> >> >but I don't think it makes sense to add a separate catalog?
>> > > >> >> >
>> > > >> >>
>> > > >> >> Sure, I understand what the goal was - all I'm saying is that it looks
>> > > >> >> very much like a workaround needed because we don't have the catalog.
>> > > >> >>
>> > > >> >> I don't quite understand how could we support custom compression methods
>> > > >> >> without listing them in some sort of catalog?
>> > > >> >
>> > > >> >Yeah for supporting custom compression we need some catalog.
>> > > >> >
>> > > >> >> >> Moreover, having the catalog would allow adding compression methods
>> > > >> >> >> (from extensions etc) instead of just having a list of hard-coded
>> > > >> >> >> compression methods. Which seems like a strange limitation, considering
>> > > >> >> >> this thread is called "custom compression methods".
>> > > >> >> >
>> > > >> >> >I think I forgot to mention while submitting the previous patch that
>> > > >> >> >the next patch I am planning to submit is, Support creating the custom
>> > > >> >> >compression methods wherein we can use pg_am catalog to insert the new
>> > > >> >> >compression method. And for dependency handling, we can create an
>> > > >> >> >attribute dependency on the pg_am row. Basically, we will create the
>> > > >> >> >attribute dependency on the current compression method AM as well as
>> > > >> >> >on the preserved compression methods AM. As part of this, we will
>> > > >> >> >add two build-in AMs for zlib and pglz, and the attcompression field
>> > > >> >> >will be converted to the oid_vector (first OID will be of the current
>> > > >> >> >compression method, followed by the preserved compression method's
>> > > >> >> >oids).
>> > > >> >> >
>> > > >> >>
>> > > >> >> Hmmm, ok. Not sure pg_am is the right place - compression methods don't
>> > > >> >> quite match what I though AMs are about, but maybe it's just my fault.
>> > > >> >>
>> > > >> >> FWIW it seems a bit strange to first do the attcompression magic and
>> > > >> >> then add the catalog later - I think we should start with the catalog
>> > > >> >> right away. The advantage is that if we end up committing only some of
>> > > >> >> the patches in this cycle, we already have all the infrastructure etc.
>> > > >> >> We can reorder that later, though.
>> > > >> >
>> > > >> >Hmm, yeah we can do this way as well that first create a new catalog
>> > > >> >table and add entries for these two built-in methods and the
>> > > >> >attcompression can store the oid vector. But if we only commit the
>> > > >> >build-in compression methods part then does it make sense to create an
>> > > >> >extra catalog or adding these build-in methods to the existing catalog
>> > > >> >(if we plan to use pg_am). Then in attcompression instead of using
>> > > >> >one byte for each preserve compression method, we need to use oid. So
>> > > >> >from Robert's mail[1], it appeared to me that he wants that the
>> > > >> >build-in compression methods part should be independently committable
>> > > >> >and if we think from that perspective then adding a catalog doesn't
>> > > >> >make much sense. But if we are planning to commit the custom method
>> > > >> >also then it makes more sense to directly start with the catalog
>> > > >> >because that way it will be easy to expand without much refactoring.
>> > > >> >
>> > > >> >[1] https://www.postgresql.org/message-id/CA%2BTgmobSDVgUage9qQ5P_%3DF_9jaMkCgyKxUQGtFQU7oN4kX-AA%40mail.gmail.com
>> > > >> >
>> > > >>
>> > > >> Hmmm. Maybe I'm missing something subtle, but I think that plan can be
>> > > >> interpreted in various ways - it does not really say whether the initial
>> > > >> list of built-in methods should be in some C array, or already in a proper
>> > > >> catalog.
>> > > >>
>> > > >> All I'm saying is it seems a bit weird to first implement dependencies
>> > > >> based on strange (mis)use of attcompression attribute, and then replace
>> > > >> it with a proper catalog. My understanding is those patches are expected
>> > > >> to be committable one by one, but the attcompression approach seems a
>> > > >> bit too hacky to me - not sure I'd want to commit that ...
>> > > >
>> > > >Okay, I will change this. So I will make create a new catalog
>> > > >pg_compression and add the entry for two built-in compression methods
>> > > >from the very first patch.
>> > > >
>> > >
>> > > OK.
>>
>> I have changed the first 2 patches, basically, now we are providing a
>> new catalog pg_compression and the pg_attribute is storing the oid of
>> the compression method. The patches still need some cleanup and there
>> is also one open comment that for index we should use its table
>> compression.
>>
>> I am still working on the preserve patch. For preserving the
>> compression method I am planning to convert the attcompression field
>> to the oidvector so that we can store the oid of the preserve method
>> also. I am not sure whether we can access this oidvector as a fixed
>> part of the FormData_pg_attribute or not. The reason is that for
>> building the tuple descriptor, we need to give the size of the fixed
>> part (#define ATTRIBUTE_FIXED_PART_SIZE \
>> (offsetof(FormData_pg_attribute,attcompression) + sizeof(Oid))). But
>> if we convert this to the oidvector then we don't know the size of the
>> fixed part. Am I missing something?
>
>I could think of two solutions here
>Sol1.
>Make the first oid of the oidvector as part of the fixed size, like below
>#define ATTRIBUTE_FIXED_PART_SIZE \
>(offsetof(FormData_pg_attribute, attcompression) + OidVectorSize(1))
>
>Sol2:
>Keep attcompression as oid only and for the preserve list, adds
>another field in the variable part which will be of type oidvector. I
>think most of the time we need to access the current compression
>method and with this solution, we will be able to access that as part
>of the tuple desc.
>

And is the oidvector actually needed? If we have the extra catalog,
can't we track this simply using the regular dependencies? So we'd have
the attcompression OID of the current compression method, and the
preserved values would be tracked in pg_depend.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, David Steele <david(at)pgmasters(dot)net>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2020-10-09 09:31:48
Message-ID:	CAFiTN-s47Le7oeKA8m=Ai2FN0MwokCHMF_jTE0KcJu3w97O91g@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Oct 9, 2020 at 3:24 AM Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>
> On Thu, Oct 08, 2020 at 02:38:27PM +0530, Dilip Kumar wrote:
> >On Wed, Oct 7, 2020 at 5:00 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> >>
> >> On Wed, Oct 7, 2020 at 10:26 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> >> >
> >> > On Tue, Oct 6, 2020 at 10:21 PM Tomas Vondra
> >> > <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> >> > >
> >> > > On Tue, Oct 06, 2020 at 11:00:55AM +0530, Dilip Kumar wrote:
> >> > > >On Mon, Oct 5, 2020 at 9:34 PM Tomas Vondra
> >> > > ><tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> >> > > >>
> >> > > >> On Mon, Oct 05, 2020 at 07:57:41PM +0530, Dilip Kumar wrote:
> >> > > >> >On Mon, Oct 5, 2020 at 5:53 PM Tomas Vondra
> >> > > >> ><tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> >> > > >> >>
> >> > > >> >> On Mon, Oct 05, 2020 at 11:17:28AM +0530, Dilip Kumar wrote:
> >> > > >> >> >On Mon, Oct 5, 2020 at 3:37 AM Tomas Vondra
> >> > > >> >> ><tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> >> > > >> >> >
> >> > > >> >> >Thanks, Tomas for your feedback.
> >> > > >> >> >
> >> > > >> >> >> 9) attcompression ...
> >> > > >> >> >>
> >> > > >> >> >> The main issue I see is what the patch does with attcompression. Instead
> >> > > >> >> >> of just using it to store a the compression method, it's also used to
> >> > > >> >> >> store the preserved compression methods. And using NameData to store
> >> > > >> >> >> this seems wrong too - if we really want to store this info, the correct
> >> > > >> >> >> way is either using text[] or inventing charvector or similar.
> >> > > >> >> >
> >> > > >> >> >The reason for using the NameData is the get it in the fixed part of
> >> > > >> >> >the data structure.
> >> > > >> >> >
> >> > > >> >>
> >> > > >> >> Why do we need that? It's possible to have varlena fields with direct
> >> > > >> >> access (see pg_index.indkey for example).
> >> > > >> >
> >> > > >> >I see. While making it NameData I was thinking whether we have an
> >> > > >> >option to direct access the varlena. Thanks for pointing me there. I
> >> > > >> >will change this.
> >> > > >> >
> >> > > >> > Adding NameData just to make
> >> > > >> >> it fixed-length means we're always adding 64B even if we just need a
> >> > > >> >> single byte, which means ~30% overhead for the FormData_pg_attribute.
> >> > > >> >> That seems a bit unnecessary, and might be an issue with many attributes
> >> > > >> >> (e.g. with many temp tables, etc.).
> >> > > >> >
> >> > > >> >You are right. Even I did not like to keep 64B for this, so I will change it.
> >> > > >> >
> >> > > >> >>
> >> > > >> >> >> But to me this seems very much like a misuse of attcompression to track
> >> > > >> >> >> dependencies on compression methods, necessary because we don't have a
> >> > > >> >> >> separate catalog listing compression methods. If we had that, I think we
> >> > > >> >> >> could simply add dependencies between attributes and that catalog.
> >> > > >> >> >
> >> > > >> >> >Basically, up to this patch, we are having only built-in compression
> >> > > >> >> >methods and those can not be dropped so we don't need any dependency
> >> > > >> >> >at all. We just want to know what is the current compression method
> >> > > >> >> >and what is the preserve compression methods supported for this
> >> > > >> >> >attribute. Maybe we can do it better instead of using the NameData
> >> > > >> >> >but I don't think it makes sense to add a separate catalog?
> >> > > >> >> >
> >> > > >> >>
> >> > > >> >> Sure, I understand what the goal was - all I'm saying is that it looks
> >> > > >> >> very much like a workaround needed because we don't have the catalog.
> >> > > >> >>
> >> > > >> >> I don't quite understand how could we support custom compression methods
> >> > > >> >> without listing them in some sort of catalog?
> >> > > >> >
> >> > > >> >Yeah for supporting custom compression we need some catalog.
> >> > > >> >
> >> > > >> >> >> Moreover, having the catalog would allow adding compression methods
> >> > > >> >> >> (from extensions etc) instead of just having a list of hard-coded
> >> > > >> >> >> compression methods. Which seems like a strange limitation, considering
> >> > > >> >> >> this thread is called "custom compression methods".
> >> > > >> >> >
> >> > > >> >> >I think I forgot to mention while submitting the previous patch that
> >> > > >> >> >the next patch I am planning to submit is, Support creating the custom
> >> > > >> >> >compression methods wherein we can use pg_am catalog to insert the new
> >> > > >> >> >compression method. And for dependency handling, we can create an
> >> > > >> >> >attribute dependency on the pg_am row. Basically, we will create the
> >> > > >> >> >attribute dependency on the current compression method AM as well as
> >> > > >> >> >on the preserved compression methods AM. As part of this, we will
> >> > > >> >> >add two build-in AMs for zlib and pglz, and the attcompression field
> >> > > >> >> >will be converted to the oid_vector (first OID will be of the current
> >> > > >> >> >compression method, followed by the preserved compression method's
> >> > > >> >> >oids).
> >> > > >> >> >
> >> > > >> >>
> >> > > >> >> Hmmm, ok. Not sure pg_am is the right place - compression methods don't
> >> > > >> >> quite match what I though AMs are about, but maybe it's just my fault.
> >> > > >> >>
> >> > > >> >> FWIW it seems a bit strange to first do the attcompression magic and
> >> > > >> >> then add the catalog later - I think we should start with the catalog
> >> > > >> >> right away. The advantage is that if we end up committing only some of
> >> > > >> >> the patches in this cycle, we already have all the infrastructure etc.
> >> > > >> >> We can reorder that later, though.
> >> > > >> >
> >> > > >> >Hmm, yeah we can do this way as well that first create a new catalog
> >> > > >> >table and add entries for these two built-in methods and the
> >> > > >> >attcompression can store the oid vector. But if we only commit the
> >> > > >> >build-in compression methods part then does it make sense to create an
> >> > > >> >extra catalog or adding these build-in methods to the existing catalog
> >> > > >> >(if we plan to use pg_am). Then in attcompression instead of using
> >> > > >> >one byte for each preserve compression method, we need to use oid. So
> >> > > >> >from Robert's mail[1], it appeared to me that he wants that the
> >> > > >> >build-in compression methods part should be independently committable
> >> > > >> >and if we think from that perspective then adding a catalog doesn't
> >> > > >> >make much sense. But if we are planning to commit the custom method
> >> > > >> >also then it makes more sense to directly start with the catalog
> >> > > >> >because that way it will be easy to expand without much refactoring.
> >> > > >> >
> >> > > >> >[1] https://www.postgresql.org/message-id/CA%2BTgmobSDVgUage9qQ5P_%3DF_9jaMkCgyKxUQGtFQU7oN4kX-AA%40mail.gmail.com
> >> > > >> >
> >> > > >>
> >> > > >> Hmmm. Maybe I'm missing something subtle, but I think that plan can be
> >> > > >> interpreted in various ways - it does not really say whether the initial
> >> > > >> list of built-in methods should be in some C array, or already in a proper
> >> > > >> catalog.
> >> > > >>
> >> > > >> All I'm saying is it seems a bit weird to first implement dependencies
> >> > > >> based on strange (mis)use of attcompression attribute, and then replace
> >> > > >> it with a proper catalog. My understanding is those patches are expected
> >> > > >> to be committable one by one, but the attcompression approach seems a
> >> > > >> bit too hacky to me - not sure I'd want to commit that ...
> >> > > >
> >> > > >Okay, I will change this. So I will make create a new catalog
> >> > > >pg_compression and add the entry for two built-in compression methods
> >> > > >from the very first patch.
> >> > > >
> >> > >
> >> > > OK.
> >>
> >> I have changed the first 2 patches, basically, now we are providing a
> >> new catalog pg_compression and the pg_attribute is storing the oid of
> >> the compression method. The patches still need some cleanup and there
> >> is also one open comment that for index we should use its table
> >> compression.
> >>
> >> I am still working on the preserve patch. For preserving the
> >> compression method I am planning to convert the attcompression field
> >> to the oidvector so that we can store the oid of the preserve method
> >> also. I am not sure whether we can access this oidvector as a fixed
> >> part of the FormData_pg_attribute or not. The reason is that for
> >> building the tuple descriptor, we need to give the size of the fixed
> >> part (#define ATTRIBUTE_FIXED_PART_SIZE \
> >> (offsetof(FormData_pg_attribute,attcompression) + sizeof(Oid))). But
> >> if we convert this to the oidvector then we don't know the size of the
> >> fixed part. Am I missing something?
> >
> >I could think of two solutions here
> >Sol1.
> >Make the first oid of the oidvector as part of the fixed size, like below
> >#define ATTRIBUTE_FIXED_PART_SIZE \
> >(offsetof(FormData_pg_attribute, attcompression) + OidVectorSize(1))
> >
> >Sol2:
> >Keep attcompression as oid only and for the preserve list, adds
> >another field in the variable part which will be of type oidvector. I
> >think most of the time we need to access the current compression
> >method and with this solution, we will be able to access that as part
> >of the tuple desc.
> >
>
> And is the oidvector actually needed? If we have the extra catalog,
> can't we track this simply using the regular dependencies? So we'd have
> the attcompression OID of the current compression method, and the
> preserved values would be tracked in pg_depend.

Right, we can do that as well. Actually, the preserved list need to
be accessed only in case of ALTER TABLE SET COMPRESSION and INSERT
INTO SELECT * FROM queries. So in such cases, I think it is okay to
get the preserved compression oids from pg_depends.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

From:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, David Steele <david(at)pgmasters(dot)net>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2020-10-12 08:58:43
Message-ID:	CAFiTN-t6pS3vnzXr7SdvHQP_brYLWCt3-0avabP2y2wGmT1Kkw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Oct 9, 2020 at 3:01 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> On Fri, Oct 9, 2020 at 3:24 AM Tomas Vondra
> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> >
> > On Thu, Oct 08, 2020 at 02:38:27PM +0530, Dilip Kumar wrote:
> > >On Wed, Oct 7, 2020 at 5:00 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> > >>
> > >> On Wed, Oct 7, 2020 at 10:26 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> > >> >
> > >> > On Tue, Oct 6, 2020 at 10:21 PM Tomas Vondra
> > >> > <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> > >> > >
> > >> > > On Tue, Oct 06, 2020 at 11:00:55AM +0530, Dilip Kumar wrote:
> > >> > > >On Mon, Oct 5, 2020 at 9:34 PM Tomas Vondra
> > >> > > ><tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> > >> > > >>
> > >> > > >> On Mon, Oct 05, 2020 at 07:57:41PM +0530, Dilip Kumar wrote:
> > >> > > >> >On Mon, Oct 5, 2020 at 5:53 PM Tomas Vondra
> > >> > > >> ><tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> > >> > > >> >>
> > >> > > >> >> On Mon, Oct 05, 2020 at 11:17:28AM +0530, Dilip Kumar wrote:
> > >> > > >> >> >On Mon, Oct 5, 2020 at 3:37 AM Tomas Vondra
> > >> > > >> >> ><tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> > >> > > >> >> >
> > >> > > >> >> >Thanks, Tomas for your feedback.
> > >> > > >> >> >
> > >> > > >> >> >> 9) attcompression ...
> > >> > > >> >> >>
> > >> > > >> >> >> The main issue I see is what the patch does with attcompression. Instead
> > >> > > >> >> >> of just using it to store a the compression method, it's also used to
> > >> > > >> >> >> store the preserved compression methods. And using NameData to store
> > >> > > >> >> >> this seems wrong too - if we really want to store this info, the correct
> > >> > > >> >> >> way is either using text[] or inventing charvector or similar.
> > >> > > >> >> >
> > >> > > >> >> >The reason for using the NameData is the get it in the fixed part of
> > >> > > >> >> >the data structure.
> > >> > > >> >> >
> > >> > > >> >>
> > >> > > >> >> Why do we need that? It's possible to have varlena fields with direct
> > >> > > >> >> access (see pg_index.indkey for example).
> > >> > > >> >
> > >> > > >> >I see. While making it NameData I was thinking whether we have an
> > >> > > >> >option to direct access the varlena. Thanks for pointing me there. I
> > >> > > >> >will change this.
> > >> > > >> >
> > >> > > >> > Adding NameData just to make
> > >> > > >> >> it fixed-length means we're always adding 64B even if we just need a
> > >> > > >> >> single byte, which means ~30% overhead for the FormData_pg_attribute.
> > >> > > >> >> That seems a bit unnecessary, and might be an issue with many attributes
> > >> > > >> >> (e.g. with many temp tables, etc.).
> > >> > > >> >
> > >> > > >> >You are right. Even I did not like to keep 64B for this, so I will change it.
> > >> > > >> >
> > >> > > >> >>
> > >> > > >> >> >> But to me this seems very much like a misuse of attcompression to track
> > >> > > >> >> >> dependencies on compression methods, necessary because we don't have a
> > >> > > >> >> >> separate catalog listing compression methods. If we had that, I think we
> > >> > > >> >> >> could simply add dependencies between attributes and that catalog.
> > >> > > >> >> >
> > >> > > >> >> >Basically, up to this patch, we are having only built-in compression
> > >> > > >> >> >methods and those can not be dropped so we don't need any dependency
> > >> > > >> >> >at all. We just want to know what is the current compression method
> > >> > > >> >> >and what is the preserve compression methods supported for this
> > >> > > >> >> >attribute. Maybe we can do it better instead of using the NameData
> > >> > > >> >> >but I don't think it makes sense to add a separate catalog?
> > >> > > >> >> >
> > >> > > >> >>
> > >> > > >> >> Sure, I understand what the goal was - all I'm saying is that it looks
> > >> > > >> >> very much like a workaround needed because we don't have the catalog.
> > >> > > >> >>
> > >> > > >> >> I don't quite understand how could we support custom compression methods
> > >> > > >> >> without listing them in some sort of catalog?
> > >> > > >> >
> > >> > > >> >Yeah for supporting custom compression we need some catalog.
> > >> > > >> >
> > >> > > >> >> >> Moreover, having the catalog would allow adding compression methods
> > >> > > >> >> >> (from extensions etc) instead of just having a list of hard-coded
> > >> > > >> >> >> compression methods. Which seems like a strange limitation, considering
> > >> > > >> >> >> this thread is called "custom compression methods".
> > >> > > >> >> >
> > >> > > >> >> >I think I forgot to mention while submitting the previous patch that
> > >> > > >> >> >the next patch I am planning to submit is, Support creating the custom
> > >> > > >> >> >compression methods wherein we can use pg_am catalog to insert the new
> > >> > > >> >> >compression method. And for dependency handling, we can create an
> > >> > > >> >> >attribute dependency on the pg_am row. Basically, we will create the
> > >> > > >> >> >attribute dependency on the current compression method AM as well as
> > >> > > >> >> >on the preserved compression methods AM. As part of this, we will
> > >> > > >> >> >add two build-in AMs for zlib and pglz, and the attcompression field
> > >> > > >> >> >will be converted to the oid_vector (first OID will be of the current
> > >> > > >> >> >compression method, followed by the preserved compression method's
> > >> > > >> >> >oids).
> > >> > > >> >> >
> > >> > > >> >>
> > >> > > >> >> Hmmm, ok. Not sure pg_am is the right place - compression methods don't
> > >> > > >> >> quite match what I though AMs are about, but maybe it's just my fault.
> > >> > > >> >>
> > >> > > >> >> FWIW it seems a bit strange to first do the attcompression magic and
> > >> > > >> >> then add the catalog later - I think we should start with the catalog
> > >> > > >> >> right away. The advantage is that if we end up committing only some of
> > >> > > >> >> the patches in this cycle, we already have all the infrastructure etc.
> > >> > > >> >> We can reorder that later, though.
> > >> > > >> >
> > >> > > >> >Hmm, yeah we can do this way as well that first create a new catalog
> > >> > > >> >table and add entries for these two built-in methods and the
> > >> > > >> >attcompression can store the oid vector. But if we only commit the
> > >> > > >> >build-in compression methods part then does it make sense to create an
> > >> > > >> >extra catalog or adding these build-in methods to the existing catalog
> > >> > > >> >(if we plan to use pg_am). Then in attcompression instead of using
> > >> > > >> >one byte for each preserve compression method, we need to use oid. So
> > >> > > >> >from Robert's mail[1], it appeared to me that he wants that the
> > >> > > >> >build-in compression methods part should be independently committable
> > >> > > >> >and if we think from that perspective then adding a catalog doesn't
> > >> > > >> >make much sense. But if we are planning to commit the custom method
> > >> > > >> >also then it makes more sense to directly start with the catalog
> > >> > > >> >because that way it will be easy to expand without much refactoring.
> > >> > > >> >
> > >> > > >> >[1] https://www.postgresql.org/message-id/CA%2BTgmobSDVgUage9qQ5P_%3DF_9jaMkCgyKxUQGtFQU7oN4kX-AA%40mail.gmail.com
> > >> > > >> >
> > >> > > >>
> > >> > > >> Hmmm. Maybe I'm missing something subtle, but I think that plan can be
> > >> > > >> interpreted in various ways - it does not really say whether the initial
> > >> > > >> list of built-in methods should be in some C array, or already in a proper
> > >> > > >> catalog.
> > >> > > >>
> > >> > > >> All I'm saying is it seems a bit weird to first implement dependencies
> > >> > > >> based on strange (mis)use of attcompression attribute, and then replace
> > >> > > >> it with a proper catalog. My understanding is those patches are expected
> > >> > > >> to be committable one by one, but the attcompression approach seems a
> > >> > > >> bit too hacky to me - not sure I'd want to commit that ...
> > >> > > >
> > >> > > >Okay, I will change this. So I will make create a new catalog
> > >> > > >pg_compression and add the entry for two built-in compression methods
> > >> > > >from the very first patch.
> > >> > > >
> > >> > >
> > >> > > OK.
> > >>
> > >> I have changed the first 2 patches, basically, now we are providing a
> > >> new catalog pg_compression and the pg_attribute is storing the oid of
> > >> the compression method. The patches still need some cleanup and there
> > >> is also one open comment that for index we should use its table
> > >> compression.
> > >>
> > >> I am still working on the preserve patch. For preserving the
> > >> compression method I am planning to convert the attcompression field
> > >> to the oidvector so that we can store the oid of the preserve method
> > >> also. I am not sure whether we can access this oidvector as a fixed
> > >> part of the FormData_pg_attribute or not. The reason is that for
> > >> building the tuple descriptor, we need to give the size of the fixed
> > >> part (#define ATTRIBUTE_FIXED_PART_SIZE \
> > >> (offsetof(FormData_pg_attribute,attcompression) + sizeof(Oid))). But
> > >> if we convert this to the oidvector then we don't know the size of the
> > >> fixed part. Am I missing something?
> > >
> > >I could think of two solutions here
> > >Sol1.
> > >Make the first oid of the oidvector as part of the fixed size, like below
> > >#define ATTRIBUTE_FIXED_PART_SIZE \
> > >(offsetof(FormData_pg_attribute, attcompression) + OidVectorSize(1))
> > >
> > >Sol2:
> > >Keep attcompression as oid only and for the preserve list, adds
> > >another field in the variable part which will be of type oidvector. I
> > >think most of the time we need to access the current compression
> > >method and with this solution, we will be able to access that as part
> > >of the tuple desc.
> > >
> >
> > And is the oidvector actually needed? If we have the extra catalog,
> > can't we track this simply using the regular dependencies? So we'd have
> > the attcompression OID of the current compression method, and the
> > preserved values would be tracked in pg_depend.
>
> Right, we can do that as well. Actually, the preserved list need to
> be accessed only in case of ALTER TABLE SET COMPRESSION and INSERT
> INTO SELECT * FROM queries. So in such cases, I think it is okay to
> get the preserved compression oids from pg_depends.

I have worked on this patch, so as discussed now I am maintaining the
preserved compression methods using dependency. Still PRESERVE ALL
syntax is not supported, I will work on that part.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment	Content-Type	Size
v6-0001-Built-in-compression-method.patch	application/octet-stream	198.1 KB
v6-0003-Add-support-for-PRESERVE.patch	application/octet-stream	39.7 KB
v6-0002-alter-table-set-compression.patch	application/octet-stream	11.9 KB

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, David Steele <david(at)pgmasters(dot)net>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2020-10-12 14:02:38
Message-ID:	20201012140238.lb3hoxfwvmmmlaix@development
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Oct 12, 2020 at 02:28:43PM +0530, Dilip Kumar wrote:
>
>> ...
>
>I have worked on this patch, so as discussed now I am maintaining the
>preserved compression methods using dependency. Still PRESERVE ALL
>syntax is not supported, I will work on that part.
>

Cool, I'll take a look. What's your opinion on doing it this way? Do you
think it's cleaner / more elegant, or is it something contrary to what
the dependencies are meant to do?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, David Steele <david(at)pgmasters(dot)net>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2020-10-13 05:00:42
Message-ID:	CAFiTN-vCwdFRkDFGUnEsSh9yAnWc4mr-kieUiZ9LTLh3cwPQrQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Oct 12, 2020 at 7:32 PM Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>
> On Mon, Oct 12, 2020 at 02:28:43PM +0530, Dilip Kumar wrote:
> >
> >> ...
> >
> >I have worked on this patch, so as discussed now I am maintaining the
> >preserved compression methods using dependency. Still PRESERVE ALL
> >syntax is not supported, I will work on that part.
> >
>
> Cool, I'll take a look. What's your opinion on doing it this way? Do you
> think it's cleaner / more elegant, or is it something contrary to what
> the dependencies are meant to do?

I think this looks much cleaner. Moreover, I feel that once we start
supporting the custom compression methods then we anyway have to
maintain the dependency so using that for finding the preserved
compression method is good option.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

From:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, David Steele <david(at)pgmasters(dot)net>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2020-10-17 06:04:05
Message-ID:	CAFiTN-vGo+b-ggRVChZqE54G8445eo+JrMiaKX5EXTVYeumRPg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Oct 13, 2020 at 10:30 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> On Mon, Oct 12, 2020 at 7:32 PM Tomas Vondra
> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> >
> > On Mon, Oct 12, 2020 at 02:28:43PM +0530, Dilip Kumar wrote:
> > >
> > >> ...
> > >
> > >I have worked on this patch, so as discussed now I am maintaining the
> > >preserved compression methods using dependency. Still PRESERVE ALL
> > >syntax is not supported, I will work on that part.
> > >
> >
> > Cool, I'll take a look. What's your opinion on doing it this way? Do you
> > think it's cleaner / more elegant, or is it something contrary to what
> > the dependencies are meant to do?
>
> I think this looks much cleaner. Moreover, I feel that once we start
> supporting the custom compression methods then we anyway have to
> maintain the dependency so using that for finding the preserved
> compression method is good option.

I have also implemented the next set of patches.
0004 -> Provide a way to create custom compression methods
0005 -> Extention to implement lz4 as a custom compression method.

A pending list of items:
1. Provide support for handling the compression option
- As discussed up thread I will store the compression option of the
latest compression method in a new field in pg_atrribute table
2. As of now I have kept zlib as the second built-in option and lz4 as
a custom compression extension. In Offlist discussion with Robert, he
suggested that we should keep lz4 as the built-in method and we can
move zlib as an extension because lz4 is faster than zlib so better to
keep that as the built-in method. So in the next version, I will
change that. Any different opinion on this?
3. Improve the documentation, especially for create_compression_method.
4. By default support table compression method for the index.
5. Support the PRESERVE ALL option so that we can preserve all
existing lists of compression methods without providing the whole
list.
6. Cleanup of 0004 and 0005 as they are still WIP.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment	Content-Type	Size
v7-0002-alter-table-set-compression.patch	application/octet-stream	11.9 KB
v7-0001-Built-in-compression-method.patch	application/octet-stream	198.2 KB
v7-0003-Add-support-for-PRESERVE.patch	application/octet-stream	39.7 KB
v7-0005-new-compression-method-extension-for-lz4_WIP.patch	application/octet-stream	9.5 KB
v7-0004-Create-custom-compression-methods_WIP.patch	application/octet-stream	37.7 KB

From:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, David Steele <david(at)pgmasters(dot)net>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2020-10-21 08:29:50
Message-ID:	CAFiTN-s+ieD_wyZZ8O3SGq_NSQ_0Ks1TXS29WTsfj4+03VDrAg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Oct 17, 2020 at 11:34 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> On Tue, Oct 13, 2020 at 10:30 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> >
> > On Mon, Oct 12, 2020 at 7:32 PM Tomas Vondra
> > <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> > >
> > > On Mon, Oct 12, 2020 at 02:28:43PM +0530, Dilip Kumar wrote:
> > > >
> > > >> ...
> > > >
> > > >I have worked on this patch, so as discussed now I am maintaining the
> > > >preserved compression methods using dependency. Still PRESERVE ALL
> > > >syntax is not supported, I will work on that part.
> > > >
> > >
> > > Cool, I'll take a look. What's your opinion on doing it this way? Do you
> > > think it's cleaner / more elegant, or is it something contrary to what
> > > the dependencies are meant to do?
> >
> > I think this looks much cleaner. Moreover, I feel that once we start
> > supporting the custom compression methods then we anyway have to
> > maintain the dependency so using that for finding the preserved
> > compression method is good option.
>
> I have also implemented the next set of patches.
> 0004 -> Provide a way to create custom compression methods
> 0005 -> Extention to implement lz4 as a custom compression method.

In the updated version I have worked on some of the listed items
> A pending list of items:
> 1. Provide support for handling the compression option
> - As discussed up thread I will store the compression option of the
> latest compression method in a new field in pg_atrribute table
> 2. As of now I have kept zlib as the second built-in option and lz4 as
> a custom compression extension. In Offlist discussion with Robert, he
> suggested that we should keep lz4 as the built-in method and we can
> move zlib as an extension because lz4 is faster than zlib so better to
> keep that as the built-in method. So in the next version, I will
> change that. Any different opinion on this?

Done

> 3. Improve the documentation, especially for create_compression_method.
> 4. By default support table compression method for the index.

Done

> 5. Support the PRESERVE ALL option so that we can preserve all
> existing lists of compression methods without providing the whole
> list.

1,3,5 points are still pending.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment	Content-Type	Size
v8-0002-alter-table-set-compression.patch	application/octet-stream	12.9 KB
v8-0001-Built-in-compression-method.patch	application/octet-stream	201.2 KB
v8-0004-Create-custom-compression-methods.patch	application/octet-stream	39.2 KB
v8-0003-Add-support-for-PRESERVE.patch	application/octet-stream	39.8 KB
v8-0005-new-compression-method-extension-for-zlib.patch	application/octet-stream	9.7 KB

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, David Steele <david(at)pgmasters(dot)net>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2020-10-21 15:20:59
Message-ID:	CA+TgmoayD1ajxJoW12e+R0a81-9fV5AR2+6V8b3ba1pmv5901w@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Oct 8, 2020 at 5:54 PM Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> And is the oidvector actually needed? If we have the extra catalog,
> can't we track this simply using the regular dependencies? So we'd have
> the attcompression OID of the current compression method, and the
> preserved values would be tracked in pg_depend.

If we go that route, we have to be sure that no such dependencies can
exist for any other reason. Otherwise, there would be confusion about
whether the dependency was there because values of that type were
being preserved in the table, or whether it was for the hypothetical
other reason. Now, admittedly, I can't quite think how that would
happen. For example, if the attribute default expression somehow
embedded a reference to a compression AM, that wouldn't cause this
problem, because the dependency would be on the attribute default
rather than the attribute itself. So maybe it's fine.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, David Steele <david(at)pgmasters(dot)net>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2020-10-21 20:41:30
Message-ID:	20201021204130.rv6jk664ndijvrgb@development
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Oct 21, 2020 at 01:59:50PM +0530, Dilip Kumar wrote:
>On Sat, Oct 17, 2020 at 11:34 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>>
>> On Tue, Oct 13, 2020 at 10:30 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>> >
>> > On Mon, Oct 12, 2020 at 7:32 PM Tomas Vondra
>> > <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> > >
>> > > On Mon, Oct 12, 2020 at 02:28:43PM +0530, Dilip Kumar wrote:
>> > > >
>> > > >> ...
>> > > >
>> > > >I have worked on this patch, so as discussed now I am maintaining the
>> > > >preserved compression methods using dependency. Still PRESERVE ALL
>> > > >syntax is not supported, I will work on that part.
>> > > >
>> > >
>> > > Cool, I'll take a look. What's your opinion on doing it this way? Do you
>> > > think it's cleaner / more elegant, or is it something contrary to what
>> > > the dependencies are meant to do?
>> >
>> > I think this looks much cleaner. Moreover, I feel that once we start
>> > supporting the custom compression methods then we anyway have to
>> > maintain the dependency so using that for finding the preserved
>> > compression method is good option.
>>
>> I have also implemented the next set of patches.
>> 0004 -> Provide a way to create custom compression methods
>> 0005 -> Extention to implement lz4 as a custom compression method.
>
>In the updated version I have worked on some of the listed items
>> A pending list of items:
>> 1. Provide support for handling the compression option
>> - As discussed up thread I will store the compression option of the
>> latest compression method in a new field in pg_atrribute table
>> 2. As of now I have kept zlib as the second built-in option and lz4 as
>> a custom compression extension. In Offlist discussion with Robert, he
>> suggested that we should keep lz4 as the built-in method and we can
>> move zlib as an extension because lz4 is faster than zlib so better to
>> keep that as the built-in method. So in the next version, I will
>> change that. Any different opinion on this?
>
>Done
>
>> 3. Improve the documentation, especially for create_compression_method.
>> 4. By default support table compression method for the index.
>
>Done
>
>> 5. Support the PRESERVE ALL option so that we can preserve all
>> existing lists of compression methods without providing the whole
>> list.
>
>1,3,5 points are still pending.
>

Thanks. I took a quick look at the patches and I think it seems fine. I
have one question, though - toast_compress_datum contains this code:

/* Call the actual compression function */
tmp = cmroutine->cmcompress((const struct varlena *) value);
if (!tmp)
return PointerGetDatum(NULL);

Shouldn't this really throw an error instead? I mean, if the compression
library returns NULL, isn't that an error?

regards

>--
>Regards,
>Dilip Kumar
>EnterpriseDB: http://www.enterprisedb.com

From:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, David Steele <david(at)pgmasters(dot)net>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2020-10-22 05:11:38
Message-ID:	CAFiTN-uWQDccBMZXNppd0qZnUbwMwcfNQJxFJg6nWXYAF4W3-w@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Oct 22, 2020 at 2:11 AM Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>
> On Wed, Oct 21, 2020 at 01:59:50PM +0530, Dilip Kumar wrote:
> >On Sat, Oct 17, 2020 at 11:34 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> >>
> >> On Tue, Oct 13, 2020 at 10:30 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> >> >
> >> > On Mon, Oct 12, 2020 at 7:32 PM Tomas Vondra
> >> > <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> >> > >
> >> > > On Mon, Oct 12, 2020 at 02:28:43PM +0530, Dilip Kumar wrote:
> >> > > >
> >> > > >> ...
> >> > > >
> >> > > >I have worked on this patch, so as discussed now I am maintaining the
> >> > > >preserved compression methods using dependency. Still PRESERVE ALL
> >> > > >syntax is not supported, I will work on that part.
> >> > > >
> >> > >
> >> > > Cool, I'll take a look. What's your opinion on doing it this way? Do you
> >> > > think it's cleaner / more elegant, or is it something contrary to what
> >> > > the dependencies are meant to do?
> >> >
> >> > I think this looks much cleaner. Moreover, I feel that once we start
> >> > supporting the custom compression methods then we anyway have to
> >> > maintain the dependency so using that for finding the preserved
> >> > compression method is good option.
> >>
> >> I have also implemented the next set of patches.
> >> 0004 -> Provide a way to create custom compression methods
> >> 0005 -> Extention to implement lz4 as a custom compression method.
> >
> >In the updated version I have worked on some of the listed items
> >> A pending list of items:
> >> 1. Provide support for handling the compression option
> >> - As discussed up thread I will store the compression option of the
> >> latest compression method in a new field in pg_atrribute table
> >> 2. As of now I have kept zlib as the second built-in option and lz4 as
> >> a custom compression extension. In Offlist discussion with Robert, he
> >> suggested that we should keep lz4 as the built-in method and we can
> >> move zlib as an extension because lz4 is faster than zlib so better to
> >> keep that as the built-in method. So in the next version, I will
> >> change that. Any different opinion on this?
> >
> >Done
> >
> >> 3. Improve the documentation, especially for create_compression_method.
> >> 4. By default support table compression method for the index.
> >
> >Done
> >
> >> 5. Support the PRESERVE ALL option so that we can preserve all
> >> existing lists of compression methods without providing the whole
> >> list.
> >
> >1,3,5 points are still pending.
> >
>
> Thanks. I took a quick look at the patches and I think it seems fine. I
> have one question, though - toast_compress_datum contains this code:
>
>
> /* Call the actual compression function */
> tmp = cmroutine->cmcompress((const struct varlena *) value);
> if (!tmp)
> return PointerGetDatum(NULL);
>
>
> Shouldn't this really throw an error instead? I mean, if the compression
> library returns NULL, isn't that an error?

I don't think that we can throw an error here because pglz_compress
might return -1 if it finds that it can not reduce the size of the
data and we consider such data as "incompressible data" and return
NULL. In such a case the caller will try to compress another
attribute of the tuple. I think we can handle such cases in the
specific handler functions.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

From:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, David Steele <david(at)pgmasters(dot)net>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2020-10-22 11:40:27
Message-ID:	CAFiTN-v7ZFg2u0h+e+iBELafBBbAFOY9OKse1UxTS5HaSeB6Cw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Oct 21, 2020 at 8:51 PM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> On Thu, Oct 8, 2020 at 5:54 PM Tomas Vondra
> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> > And is the oidvector actually needed? If we have the extra catalog,
> > can't we track this simply using the regular dependencies? So we'd have
> > the attcompression OID of the current compression method, and the
> > preserved values would be tracked in pg_depend.
>
> If we go that route, we have to be sure that no such dependencies can
> exist for any other reason. Otherwise, there would be confusion about
> whether the dependency was there because values of that type were
> being preserved in the table, or whether it was for the hypothetical
> other reason. Now, admittedly, I can't quite think how that would
> happen. For example, if the attribute default expression somehow
> embedded a reference to a compression AM, that wouldn't cause this
> problem, because the dependency would be on the attribute default
> rather than the attribute itself. So maybe it's fine.

Yeah, and moreover in the new patchset, we are storing the compression
methods in the new catalog 'pg_compression' instead of merging with
the pg_am. So I think only for the preserve purpose we will maintain
the attribute -> pg_compression dependency so it should be fine.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

From:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, David Steele <david(at)pgmasters(dot)net>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2020-10-22 12:26:55
Message-ID:	CAFiTN-tn_TLqxTuP8P0pQEt9Zth+DKQTxd-uMGCTpdZXDovKxg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Oct 22, 2020 at 10:41 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> On Thu, Oct 22, 2020 at 2:11 AM Tomas Vondra
> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> >
> > On Wed, Oct 21, 2020 at 01:59:50PM +0530, Dilip Kumar wrote:
> > >On Sat, Oct 17, 2020 at 11:34 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> > >>
> > >> On Tue, Oct 13, 2020 at 10:30 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> > >> >
> > >> > On Mon, Oct 12, 2020 at 7:32 PM Tomas Vondra
> > >> > <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> > >> > >
> > >> > > On Mon, Oct 12, 2020 at 02:28:43PM +0530, Dilip Kumar wrote:
> > >> > > >
> > >> > > >> ...
> > >> > > >
> > >> > > >I have worked on this patch, so as discussed now I am maintaining the
> > >> > > >preserved compression methods using dependency. Still PRESERVE ALL
> > >> > > >syntax is not supported, I will work on that part.
> > >> > > >
> > >> > >
> > >> > > Cool, I'll take a look. What's your opinion on doing it this way? Do you
> > >> > > think it's cleaner / more elegant, or is it something contrary to what
> > >> > > the dependencies are meant to do?
> > >> >
> > >> > I think this looks much cleaner. Moreover, I feel that once we start
> > >> > supporting the custom compression methods then we anyway have to
> > >> > maintain the dependency so using that for finding the preserved
> > >> > compression method is good option.
> > >>
> > >> I have also implemented the next set of patches.
> > >> 0004 -> Provide a way to create custom compression methods
> > >> 0005 -> Extention to implement lz4 as a custom compression method.
> > >
> > >In the updated version I have worked on some of the listed items
> > >> A pending list of items:
> > >> 1. Provide support for handling the compression option
> > >> - As discussed up thread I will store the compression option of the
> > >> latest compression method in a new field in pg_atrribute table
> > >> 2. As of now I have kept zlib as the second built-in option and lz4 as
> > >> a custom compression extension. In Offlist discussion with Robert, he
> > >> suggested that we should keep lz4 as the built-in method and we can
> > >> move zlib as an extension because lz4 is faster than zlib so better to
> > >> keep that as the built-in method. So in the next version, I will
> > >> change that. Any different opinion on this?
> > >
> > >Done
> > >
> > >> 3. Improve the documentation, especially for create_compression_method.
> > >> 4. By default support table compression method for the index.
> > >
> > >Done
> > >
> > >> 5. Support the PRESERVE ALL option so that we can preserve all
> > >> existing lists of compression methods without providing the whole
> > >> list.
> > >
> > >1,3,5 points are still pending.
> > >
> >
> > Thanks. I took a quick look at the patches and I think it seems fine. I
> > have one question, though - toast_compress_datum contains this code:
> >
> >
> > /* Call the actual compression function */
> > tmp = cmroutine->cmcompress((const struct varlena *) value);
> > if (!tmp)
> > return PointerGetDatum(NULL);
> >
> >
> > Shouldn't this really throw an error instead? I mean, if the compression
> > library returns NULL, isn't that an error?
>
> I don't think that we can throw an error here because pglz_compress
> might return -1 if it finds that it can not reduce the size of the
> data and we consider such data as "incompressible data" and return
> NULL. In such a case the caller will try to compress another
> attribute of the tuple. I think we can handle such cases in the
> specific handler functions.

I have added the compression failure error in lz4.c, please refer
lz4_cmcompress in v9-0001 patch. Apart from that, I have also
supported the PRESERVE ALL syntax to preserve all the existing
compression methods. I have also rebased the patch on the current
head.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment	Content-Type	Size
v9-0002-alter-table-set-compression.patch	application/octet-stream	12.8 KB
v9-0001-Built-in-compression-method.patch	application/octet-stream	201.1 KB
v9-0004-Create-custom-compression-methods.patch	application/octet-stream	39.3 KB
v9-0003-Add-support-for-PRESERVE.patch	application/octet-stream	40.3 KB
v9-0005-new-compression-method-extension-for-zlib.patch	application/octet-stream	9.7 KB

From:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, David Steele <david(at)pgmasters(dot)net>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2020-10-27 05:24:32
Message-ID:	CAFiTN-tZc-Dq5FyyzU_tt+GriTVW1ah3rg6-7_WpCjLf1QgG5Q@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Oct 22, 2020 at 5:56 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> On Thu, Oct 22, 2020 at 10:41 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> >
> > On Thu, Oct 22, 2020 at 2:11 AM Tomas Vondra
> > <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> > >
> > > On Wed, Oct 21, 2020 at 01:59:50PM +0530, Dilip Kumar wrote:
> > > >On Sat, Oct 17, 2020 at 11:34 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> > > >>
> > > >> On Tue, Oct 13, 2020 at 10:30 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> > > >> >
> > > >> > On Mon, Oct 12, 2020 at 7:32 PM Tomas Vondra
> > > >> > <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> > > >> > >
> > > >> > > On Mon, Oct 12, 2020 at 02:28:43PM +0530, Dilip Kumar wrote:
> > > >> > > >
> > > >> > > >> ...
> > > >> > > >
> > > >> > > >I have worked on this patch, so as discussed now I am maintaining the
> > > >> > > >preserved compression methods using dependency. Still PRESERVE ALL
> > > >> > > >syntax is not supported, I will work on that part.
> > > >> > > >
> > > >> > >
> > > >> > > Cool, I'll take a look. What's your opinion on doing it this way? Do you
> > > >> > > think it's cleaner / more elegant, or is it something contrary to what
> > > >> > > the dependencies are meant to do?
> > > >> >
> > > >> > I think this looks much cleaner. Moreover, I feel that once we start
> > > >> > supporting the custom compression methods then we anyway have to
> > > >> > maintain the dependency so using that for finding the preserved
> > > >> > compression method is good option.
> > > >>
> > > >> I have also implemented the next set of patches.
> > > >> 0004 -> Provide a way to create custom compression methods
> > > >> 0005 -> Extention to implement lz4 as a custom compression method.
> > > >
> > > >In the updated version I have worked on some of the listed items
> > > >> A pending list of items:
> > > >> 1. Provide support for handling the compression option
> > > >> - As discussed up thread I will store the compression option of the
> > > >> latest compression method in a new field in pg_atrribute table
> > > >> 2. As of now I have kept zlib as the second built-in option and lz4 as
> > > >> a custom compression extension. In Offlist discussion with Robert, he
> > > >> suggested that we should keep lz4 as the built-in method and we can
> > > >> move zlib as an extension because lz4 is faster than zlib so better to
> > > >> keep that as the built-in method. So in the next version, I will
> > > >> change that. Any different opinion on this?
> > > >
> > > >Done
> > > >
> > > >> 3. Improve the documentation, especially for create_compression_method.
> > > >> 4. By default support table compression method for the index.
> > > >
> > > >Done
> > > >
> > > >> 5. Support the PRESERVE ALL option so that we can preserve all
> > > >> existing lists of compression methods without providing the whole
> > > >> list.
> > > >
> > > >1,3,5 points are still pending.
> > > >
> > >
> > > Thanks. I took a quick look at the patches and I think it seems fine. I
> > > have one question, though - toast_compress_datum contains this code:
> > >
> > >
> > > /* Call the actual compression function */
> > > tmp = cmroutine->cmcompress((const struct varlena *) value);
> > > if (!tmp)
> > > return PointerGetDatum(NULL);
> > >
> > >
> > > Shouldn't this really throw an error instead? I mean, if the compression
> > > library returns NULL, isn't that an error?
> >
> > I don't think that we can throw an error here because pglz_compress
> > might return -1 if it finds that it can not reduce the size of the
> > data and we consider such data as "incompressible data" and return
> > NULL. In such a case the caller will try to compress another
> > attribute of the tuple. I think we can handle such cases in the
> > specific handler functions.
>
> I have added the compression failure error in lz4.c, please refer
> lz4_cmcompress in v9-0001 patch. Apart from that, I have also
> supported the PRESERVE ALL syntax to preserve all the existing
> compression methods. I have also rebased the patch on the current
> head.

I have added the next patch to support the compression options. I am
storing the compression options only for the latest compression
method. Basically, based on this design we would be able to support
the options which are used only for compressions. As of now, the
compression option infrastructure is added and the compression options
for inbuilt method pglz and the external method zlib are added. Next,
I will work on adding the options for the lz4 method.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment	Content-Type	Size
v10-0002-alter-table-set-compression.patch	application/octet-stream	12.8 KB
v10-0001-Built-in-compression-method.patch	application/octet-stream	200.8 KB
v10-0005-new-compression-method-extension-for-zlib.patch	application/octet-stream	9.8 KB
v10-0003-Add-support-for-PRESERVE.patch	application/octet-stream	40.4 KB
v10-0004-Create-custom-compression-methods.patch	application/octet-stream	39.3 KB
v10-0006-Support-compression-methods-options.patch	application/octet-stream	41.5 KB

From:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, David Steele <david(at)pgmasters(dot)net>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2020-10-28 07:46:31
Message-ID:	CAFiTN-sySfmVMiiNEGo8QOirmzXw3P4n8+DUXnt85=eNNfpD4w@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Oct 27, 2020 at 10:54 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> On Thu, Oct 22, 2020 at 5:56 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> >
> > On Thu, Oct 22, 2020 at 10:41 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> > >
> > > On Thu, Oct 22, 2020 at 2:11 AM Tomas Vondra
> > > <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> > > >
> > > > On Wed, Oct 21, 2020 at 01:59:50PM +0530, Dilip Kumar wrote:
> > > > >On Sat, Oct 17, 2020 at 11:34 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> > > > >>
> > > > >> On Tue, Oct 13, 2020 at 10:30 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> > > > >> >
> > > > >> > On Mon, Oct 12, 2020 at 7:32 PM Tomas Vondra
> > > > >> > <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> > > > >> > >
> > > > >> > > On Mon, Oct 12, 2020 at 02:28:43PM +0530, Dilip Kumar wrote:
> > > > >> > > >
> > > > >> > > >> ...
> > > > >> > > >
> > > > >> > > >I have worked on this patch, so as discussed now I am maintaining the
> > > > >> > > >preserved compression methods using dependency. Still PRESERVE ALL
> > > > >> > > >syntax is not supported, I will work on that part.
> > > > >> > > >
> > > > >> > >
> > > > >> > > Cool, I'll take a look. What's your opinion on doing it this way? Do you
> > > > >> > > think it's cleaner / more elegant, or is it something contrary to what
> > > > >> > > the dependencies are meant to do?
> > > > >> >
> > > > >> > I think this looks much cleaner. Moreover, I feel that once we start
> > > > >> > supporting the custom compression methods then we anyway have to
> > > > >> > maintain the dependency so using that for finding the preserved
> > > > >> > compression method is good option.
> > > > >>
> > > > >> I have also implemented the next set of patches.
> > > > >> 0004 -> Provide a way to create custom compression methods
> > > > >> 0005 -> Extention to implement lz4 as a custom compression method.
> > > > >
> > > > >In the updated version I have worked on some of the listed items
> > > > >> A pending list of items:
> > > > >> 1. Provide support for handling the compression option
> > > > >> - As discussed up thread I will store the compression option of the
> > > > >> latest compression method in a new field in pg_atrribute table
> > > > >> 2. As of now I have kept zlib as the second built-in option and lz4 as
> > > > >> a custom compression extension. In Offlist discussion with Robert, he
> > > > >> suggested that we should keep lz4 as the built-in method and we can
> > > > >> move zlib as an extension because lz4 is faster than zlib so better to
> > > > >> keep that as the built-in method. So in the next version, I will
> > > > >> change that. Any different opinion on this?
> > > > >
> > > > >Done
> > > > >
> > > > >> 3. Improve the documentation, especially for create_compression_method.
> > > > >> 4. By default support table compression method for the index.
> > > > >
> > > > >Done
> > > > >
> > > > >> 5. Support the PRESERVE ALL option so that we can preserve all
> > > > >> existing lists of compression methods without providing the whole
> > > > >> list.
> > > > >
> > > > >1,3,5 points are still pending.
> > > > >
> > > >
> > > > Thanks. I took a quick look at the patches and I think it seems fine. I
> > > > have one question, though - toast_compress_datum contains this code:
> > > >
> > > >
> > > > /* Call the actual compression function */
> > > > tmp = cmroutine->cmcompress((const struct varlena *) value);
> > > > if (!tmp)
> > > > return PointerGetDatum(NULL);
> > > >
> > > >
> > > > Shouldn't this really throw an error instead? I mean, if the compression
> > > > library returns NULL, isn't that an error?
> > >
> > > I don't think that we can throw an error here because pglz_compress
> > > might return -1 if it finds that it can not reduce the size of the
> > > data and we consider such data as "incompressible data" and return
> > > NULL. In such a case the caller will try to compress another
> > > attribute of the tuple. I think we can handle such cases in the
> > > specific handler functions.
> >
> > I have added the compression failure error in lz4.c, please refer
> > lz4_cmcompress in v9-0001 patch. Apart from that, I have also
> > supported the PRESERVE ALL syntax to preserve all the existing
> > compression methods. I have also rebased the patch on the current
> > head.
>
> I have added the next patch to support the compression options. I am
> storing the compression options only for the latest compression
> method. Basically, based on this design we would be able to support
> the options which are used only for compressions. As of now, the
> compression option infrastructure is added and the compression options
> for inbuilt method pglz and the external method zlib are added. Next,
> I will work on adding the options for the lz4 method.

In the attached patch set I have also included the compression option
support for lz4. As of now, I have only supported the acceleration
for LZ4_compress_fast. There is also support for the dictionary-based
compression but if we try to support that then we will need the
dictionary for decompression also. Since we are only keeping the
options for the current compression methods, we can not support
dictionary-based options as of now.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment	Content-Type	Size
v11-0004-Create-custom-compression-methods.patch	application/octet-stream	39.3 KB
v11-0005-new-compression-method-extension-for-zlib.patch	application/octet-stream	9.8 KB
v11-0001-Built-in-compression-method.patch	application/octet-stream	200.8 KB
v11-0003-Add-support-for-PRESERVE.patch	application/octet-stream	40.4 KB
v11-0002-alter-table-set-compression.patch	application/octet-stream	12.8 KB
v11-0006-Support-compression-methods-options.patch	application/octet-stream	43.8 KB

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, David Steele <david(at)pgmasters(dot)net>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2020-10-28 19:01:21
Message-ID:	20201028190121.kj72ftoxyub4wsed@development
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Oct 28, 2020 at 01:16:31PM +0530, Dilip Kumar wrote:
>>
>> ...
>>
>> I have added the next patch to support the compression options. I am
>> storing the compression options only for the latest compression
>> method. Basically, based on this design we would be able to support
>> the options which are used only for compressions. As of now, the
>> compression option infrastructure is added and the compression options
>> for inbuilt method pglz and the external method zlib are added. Next,
>> I will work on adding the options for the lz4 method.
>
>In the attached patch set I have also included the compression option
>support for lz4. As of now, I have only supported the acceleration
>for LZ4_compress_fast. There is also support for the dictionary-based
>compression but if we try to support that then we will need the
>dictionary for decompression also. Since we are only keeping the
>options for the current compression methods, we can not support
>dictionary-based options as of now.
>

OK, thanks. Do you have any other plans to improve this patch series? I
plan to do some testing and review, but if you're likely to post another
version soon then I'd wait a bit.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, David Steele <david(at)pgmasters(dot)net>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2020-10-29 06:37:16
Message-ID:	CAFiTN-tCmwGa_Bp-eR+_LHNa5MjGs375z1x4vy4VQXDWg3YBKA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Oct 29, 2020 at 12:31 AM Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>
> On Wed, Oct 28, 2020 at 01:16:31PM +0530, Dilip Kumar wrote:
> >>
> >> ...
> >>
> >> I have added the next patch to support the compression options. I am
> >> storing the compression options only for the latest compression
> >> method. Basically, based on this design we would be able to support
> >> the options which are used only for compressions. As of now, the
> >> compression option infrastructure is added and the compression options
> >> for inbuilt method pglz and the external method zlib are added. Next,
> >> I will work on adding the options for the lz4 method.
> >
> >In the attached patch set I have also included the compression option
> >support for lz4. As of now, I have only supported the acceleration
> >for LZ4_compress_fast. There is also support for the dictionary-based
> >compression but if we try to support that then we will need the
> >dictionary for decompression also. Since we are only keeping the
> >options for the current compression methods, we can not support
> >dictionary-based options as of now.
> >
>
> OK, thanks. Do you have any other plans to improve this patch series? I
> plan to do some testing and review, but if you're likely to post another
> version soon then I'd wait a bit.

There was some issue in create_compression_method.sgml and the
drop_compression_method.sgml was missing. I have fixed that in the
attached patch. Now I am not planning to change anything soon so you
can review. Thanks.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment	Content-Type	Size
v12-0002-alter-table-set-compression.patch	application/octet-stream	12.8 KB
v12-0001-Built-in-compression-method.patch	application/octet-stream	200.8 KB
v12-0005-new-compression-method-extension-for-zlib.patch	application/octet-stream	9.8 KB
v12-0004-Create-custom-compression-methods.patch	application/octet-stream	44.7 KB
v12-0003-Add-support-for-PRESERVE.patch	application/octet-stream	40.4 KB
v12-0006-Support-compression-methods-options.patch	application/octet-stream	43.8 KB

From:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, David Steele <david(at)pgmasters(dot)net>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2020-11-08 10:59:45
Message-ID:	CAFiTN-uhn=OdxR6dTZa_zNtUwWG3RCobStyeva0EeEvZsmN8ug@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Oct 29, 2020 at 12:07 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> On Thu, Oct 29, 2020 at 12:31 AM Tomas Vondra
> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> >
> > On Wed, Oct 28, 2020 at 01:16:31PM +0530, Dilip Kumar wrote:
> > >>
> > >> ...
> > >>
> > >> I have added the next patch to support the compression options. I am
> > >> storing the compression options only for the latest compression
> > >> method. Basically, based on this design we would be able to support
> > >> the options which are used only for compressions. As of now, the
> > >> compression option infrastructure is added and the compression options
> > >> for inbuilt method pglz and the external method zlib are added. Next,
> > >> I will work on adding the options for the lz4 method.
> > >
> > >In the attached patch set I have also included the compression option
> > >support for lz4. As of now, I have only supported the acceleration
> > >for LZ4_compress_fast. There is also support for the dictionary-based
> > >compression but if we try to support that then we will need the
> > >dictionary for decompression also. Since we are only keeping the
> > >options for the current compression methods, we can not support
> > >dictionary-based options as of now.
> > >
> >
> > OK, thanks. Do you have any other plans to improve this patch series? I
> > plan to do some testing and review, but if you're likely to post another
> > version soon then I'd wait a bit.
>
> There was some issue in create_compression_method.sgml and the
> drop_compression_method.sgml was missing. I have fixed that in the
> attached patch. Now I am not planning to change anything soon so you
> can review. Thanks.

The patches were not applying on the current head so I have re-based them.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment	Content-Type	Size
v13-0003-Add-support-for-PRESERVE.patch	application/octet-stream	40.6 KB
v13-0001-Built-in-compression-method.patch	application/octet-stream	200.8 KB
v13-0002-alter-table-set-compression.patch	application/octet-stream	12.9 KB
v13-0005-new-compression-method-extension-for-zlib.patch	application/octet-stream	9.8 KB
v13-0004-Create-custom-compression-methods.patch	application/octet-stream	44.8 KB
v13-0006-Support-compression-methods-options.patch	application/octet-stream	44.4 KB

From:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, David Steele <david(at)pgmasters(dot)net>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2020-11-11 14:39:37
Message-ID:	CAFiTN-vwVHdGi36qoF_ZRW5ehD+ea642XH=+EBk50Qc4V1h1TA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Nov 8, 2020 at 4:29 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> On Thu, Oct 29, 2020 at 12:07 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> >
> > On Thu, Oct 29, 2020 at 12:31 AM Tomas Vondra
> > <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> > >
> > > On Wed, Oct 28, 2020 at 01:16:31PM +0530, Dilip Kumar wrote:
> > > >>
> > > >> ...
> > > >>
> > > >> I have added the next patch to support the compression options. I am
> > > >> storing the compression options only for the latest compression
> > > >> method. Basically, based on this design we would be able to support
> > > >> the options which are used only for compressions. As of now, the
> > > >> compression option infrastructure is added and the compression options
> > > >> for inbuilt method pglz and the external method zlib are added. Next,
> > > >> I will work on adding the options for the lz4 method.
> > > >
> > > >In the attached patch set I have also included the compression option
> > > >support for lz4. As of now, I have only supported the acceleration
> > > >for LZ4_compress_fast. There is also support for the dictionary-based
> > > >compression but if we try to support that then we will need the
> > > >dictionary for decompression also. Since we are only keeping the
> > > >options for the current compression methods, we can not support
> > > >dictionary-based options as of now.
> > > >
> > >
> > > OK, thanks. Do you have any other plans to improve this patch series? I
> > > plan to do some testing and review, but if you're likely to post another
> > > version soon then I'd wait a bit.
> >
> > There was some issue in create_compression_method.sgml and the
> > drop_compression_method.sgml was missing. I have fixed that in the
> > attached patch. Now I am not planning to change anything soon so you
> > can review. Thanks.
>
> The patches were not applying on the current head so I have re-based them.

There were a few problems in this rebased version, basically, the
compression options were not passed while compressing values from the
brin_form_tuple, so I have fixed this.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment	Content-Type	Size
v14-0005-new-compression-method-extension-for-zlib.patch	application/octet-stream	9.8 KB
v14-0002-alter-table-set-compression.patch	application/octet-stream	12.9 KB
v14-0001-Built-in-compression-method.patch	application/octet-stream	200.9 KB
v14-0003-Add-support-for-PRESERVE.patch	application/octet-stream	40.6 KB
v14-0004-Create-custom-compression-methods.patch	application/octet-stream	44.8 KB
v14-0006-Support-compression-methods-options.patch	application/octet-stream	49.7 KB

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Cc:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, David Steele <david(at)pgmasters(dot)net>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2020-11-20 22:20:43
Message-ID:	CA+TgmoaKDW1Oi9V=jc9hOGyf77NbkNEABuqgHD1Cq==1QsOcxg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Nov 11, 2020 at 9:39 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> There were a few problems in this rebased version, basically, the
> compression options were not passed while compressing values from the
> brin_form_tuple, so I have fixed this.

Since the authorship history of this patch is complicated, it would be
nice if you would include authorship information and relevant
"Discussion" links in the patches.

Design level considerations and overall notes:

configure is autogenerated from configure.in, so the patch shouldn't
include changes only to the former.

Looking over the changes to src/include:

+ PGLZ_COMPRESSION_ID,
+ LZ4_COMPRESSION_ID

I think that it would be good to assign values to these explicitly.

+/* compresion handler routines */

Spelling.

+ /* compression routine for the compression method */
+ cmcompress_function cmcompress;
+
+ /* decompression routine for the compression method */
+ cmcompress_function cmdecompress;

Don't reuse cmcompress_function; that's confusing. Just have a typedef
per structure member, even if they end up being the same.

#define TOAST_COMPRESS_SET_RAWSIZE(ptr, len) \
- (((toast_compress_header *) (ptr))->rawsize = (len))
+do { \
+ Assert(len > 0 && len <= RAWSIZEMASK); \
+ ((toast_compress_header *) (ptr))->info = (len); \
+} while (0)

Indentation.

+#define TOAST_COMPRESS_SET_COMPRESSION_METHOD(ptr, cm_method) \
+ ((toast_compress_header *) (ptr))->info |= ((cm_method) << 30);

What about making TOAST_COMPRESS_SET_RAWSIZE() take another argument?
And possibly also rename it to TEST_COMPRESS_SET_SIZE_AND_METHOD() or
something? It seems not great to have separate functions each setting
part of a 4-byte quantity. Too much chance of failing to set both
parts. I guess you've got a function called
toast_set_compressed_datum_info() for that, but it's just a wrapper
around two macros that could just be combined, which would reduce
complexity overall.

+ T_CompressionRoutine, /* in access/compressionapi.h */

This looks misplaced. I guess it should go just after these:

T_FdwRoutine, /* in foreign/fdwapi.h */
T_IndexAmRoutine, /* in access/amapi.h */
T_TableAmRoutine, /* in access/tableam.h */

Looking over the regression test changes:

The tests at the top of create_cm.out that just test that we can
create tables with various storage types seem unrelated to the purpose
of the patch. And the file doesn't test creating a compression method
either, as the file name would suggest, so either the file name needs
to be changed (compression, compression_method?) or the tests don't go
here.

+-- check data is okdd

I guess whoever is responsible for this comment prefers vi to emacs.

I don't quite understand the purpose of all of these tests, and there
are some things that I feel like ought to be tested that seemingly
aren't. Like, you seem to test using an UPDATE to move a datum from a
table to another table with the same compression method, but not one
with a different compression method. Testing the former is nice and
everything, but that's the easy case: I think we also need to test the
latter. I think it would be good to verify not only that the data is
readable but that it's compressed the way we expect. I think it would
be a great idea to add a pg_column_compression() function in a similar
spirit to pg_column_size(). Perhaps it could return NULL when
compression is not in use or the data type is not varlena, and the
name of the compression method otherwise. That would allow for better
testing of this feature, and it would also be useful to users who are
switching methods, to see what data they still have that's using the
old method. It could be useful for debugging problems on customer
systems, too.

I wonder if we need a test that moves data between tables through an
intermediary. For instance, suppose a plpgsql function or DO block
fetches some data and stores it in a plpgsql variable and then uses
the variable to insert into another table. Hmm, maybe that would force
de-TOASTing. But perhaps there are other cases. Maybe a more general
way to approach the problem is: have you tried running a coverage
report and checked which parts of your code are getting exercised by
the existing tests and which parts are not? The stuff that isn't, we
should try to add more tests. It's easy to get corner cases wrong with
this kind of thing.

I notice that LIKE INCLUDING COMPRESSION doesn't seem to be tested, at
least not by 0001, which reinforces my feeling that the tests here are
not as thorough as they could be.

+NOTICE: pg_compression contains unpinned initdb-created object(s)

This seems wrong to me - why is it OK?

- result = (struct varlena *)
- palloc(TOAST_COMPRESS_RAWSIZE(attr) + VARHDRSZ);
- SET_VARSIZE(result, TOAST_COMPRESS_RAWSIZE(attr) + VARHDRSZ);
+ cmoid = GetCompressionOidFromCompressionId(TOAST_COMPRESS_METHOD(attr));

- if (pglz_decompress(TOAST_COMPRESS_RAWDATA(attr),
- TOAST_COMPRESS_SIZE(attr),
- VARDATA(result),
-
TOAST_COMPRESS_RAWSIZE(attr), true) < 0)
- elog(ERROR, "compressed data is corrupted");
+ /* get compression method handler routines */
+ cmroutine = GetCompressionRoutine(cmoid);

- return result;
+ return cmroutine->cmdecompress(attr);

I'm worried about how expensive this might be, and I think we could
make it cheaper. The reason why I think this might be expensive is:
currently, for every datum, you have a single direct function call.
Now, with this, you first have a direct function call to
GetCompressionOidFromCompressionId(). Then you have a call to
GetCompressionRoutine(), which does a syscache lookup and calls a
handler function, which is quite a lot more expensive than a single
function call. And the handler isn't even returning a statically
allocated structure, but is allocating new memory every time, which
involves more function calls and maybe memory leaks. Then you use the
results of all that to make an indirect function call.

I'm not sure exactly what combination of things we could use to make
this better, but it seems like there are a few possibilities:

(1) The handler function could return a pointer to the same
CompressionRoutine every time instead of constructing a new one every
time.
(2) The CompressionRoutine to which the handler function returns a
pointer could be statically allocated instead of being built at
runtime.
(3) GetCompressionRoutine could have an OID -> handler cache instead
of relying on syscache + calling the handler function all over again.
(4) For the compression types that have dedicated bit patterns in the
high bits of the compressed TOAST size, toast_compress_datum() could
just have hard-coded logic to use the correct handlers instead of
translating the bit pattern into an OID and then looking it up over
again.
(5) Going even further than #4 we could skip the handler layer
entirely for such methods, and just call the right function directly.

I think we should definitely do (1), and also (2) unless there's some
reason it's hard. (3) doesn't need to be part of this patch, but might
be something to consider later in the series. It's possible that it
doesn't have enough benefit to be worth the work, though. Also, I
think we should do either (4) or (5). I have a mild preference for (5)
unless it looks too ugly.

Note that I'm not talking about hard-coding a fast path for a
hard-coded list of OIDs - which would seem a little bit unprincipled -
but hard-coding a fast path for the bit patterns that are themselves
hard-coded. I don't think we lose anything in terms of extensibility
or even-handedness there; it's just avoiding a bunch of rigamarole
that doesn't really buy us anything.

All these points apply equally to toast_decompress_datum_slice() and
toast_compress_datum().

+ /* Fallback to default compression method, if not specified */
+ if (!OidIsValid(cmoid))
+ cmoid = DefaultCompressionOid;

I think that the caller should be required to specify a legal value,
and this should be an elog(ERROR) or an Assert().

The change to equalTupleDescs() makes me wonder. Like, can we specify
the compression method for a function parameter, or a function return
value? I would think not. But then how are the tuple descriptors set
up in that case? Under what circumstances do we actually need the
tuple descriptors to compare unequal?

lz4.c's header comment calls it cm_lz4.c, and the pathname is wrong too.

I wonder if we should try to adopt a convention for the names of these
files that isn't just the compression method name, like cmlz4 or
compress_lz4. I kind of like the latter one. I am a little worried
that just calling it lz4.c will result in name collisions later - not
in this directory, of course, but elsewhere in the system. It's not a
disaster if that happens, but for example verbose error reports print
the file name, so it's nice if it's unambiguous.

+ if (!IsBinaryUpgrade &&
+ (relkind == RELKIND_RELATION ||
+ relkind == RELKIND_PARTITIONED_TABLE))
+ attr->attcompression =
+
GetAttributeCompressionMethod(attr, colDef->compression);
+ else
+ attr->attcompression = InvalidOid;

Storing InvalidOid in the IsBinaryUpgrade case looks wrong. If
upgrading from pre-v14, we need to store PGLZ_COMPRESSION_OID.
Otherwise, we need to preserve whatever value was present in the old
version. Or am I confused here?

I think there should be tests for the way this interacts with
partitioning, and I think the intended interaction should be
documented. Perhaps it should behave like TABLESPACE, where the parent
property has no effect on what gets stored because the parent has no
storage, but is inherited by each new child.

I wonder in passing about TOAST tables and materialized views, which
are the other things that have storage. What gets stored for
attcompression? For a TOAST table it probably doesn't matter much
since TOAST table entries shouldn't ever be toasted themselves, so
anything that doesn't crash is fine (but maybe we should test that
trying to alter the compression properties of a TOAST table doesn't
crash, for example). For a materialized view it seems reasonable to
want to set column properties, but I'm not quite sure how that works
today for things like STORAGE anyway. If we do allow setting STORAGE
or COMPRESSION for materialized view columns then dump-and-reload
needs to preserve the values.

+ /*
+ * Use default compression method if the existing compression method is
+ * invalid but the new storage type is non plain storage.
+ */
+ if (!OidIsValid(attrtuple->attcompression) &&
+ (newstorage != TYPSTORAGE_PLAIN))
+ attrtuple->attcompression = DefaultCompressionOid;

You have a few too many parens in there.

I don't see a particularly good reason to treat plain and external
differently. More generally, I think there's a question here about
when we need an attribute to have a valid compression type and when we
don't. If typstorage is plan or external, then there's no point in
ever having a compression type and maybe we should even reject
attempts to set one (but I'm not sure). However, the attstorage is a
different case. Suppose the column is created with extended storage
and then later it's changed to plain. That's only a hint, so there may
still be toasted values in that column, so the compression setting
must endure. At any rate, we need to make sure we have clear and
sensible rules for when attcompression (a) must be valid, (b) may be
valid, and (c) must be invalid. And those rules need to at least be
documented in the comments, and maybe in the SGML docs.

I'm out of time for today, so I'll have to look at this more another
day. Hope this helps for a start.

--
Robert Haas
EDB: http://www.enterprisedb.com

From:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, David Steele <david(at)pgmasters(dot)net>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2020-11-23 12:00:45
Message-ID:	CAFiTN-sm=PFP4eRo11warbTz_uA1qZoOVUofcmG2WKvJ=M=gvg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Nov 21, 2020 at 3:50 AM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> On Wed, Nov 11, 2020 at 9:39 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> > There were a few problems in this rebased version, basically, the
> > compression options were not passed while compressing values from the
> > brin_form_tuple, so I have fixed this.
>
> Since the authorship history of this patch is complicated, it would be
> nice if you would include authorship information and relevant
> "Discussion" links in the patches.
>
> Design level considerations and overall notes:
>
> configure is autogenerated from configure.in, so the patch shouldn't
> include changes only to the former.
>
> Looking over the changes to src/include:
>
> + PGLZ_COMPRESSION_ID,
> + LZ4_COMPRESSION_ID
>
> I think that it would be good to assign values to these explicitly.
>
> +/* compresion handler routines */
>
> Spelling.
>
> + /* compression routine for the compression method */
> + cmcompress_function cmcompress;
> +
> + /* decompression routine for the compression method */
> + cmcompress_function cmdecompress;
>
> Don't reuse cmcompress_function; that's confusing. Just have a typedef
> per structure member, even if they end up being the same.
>
> #define TOAST_COMPRESS_SET_RAWSIZE(ptr, len) \
> - (((toast_compress_header *) (ptr))->rawsize = (len))
> +do { \
> + Assert(len > 0 && len <= RAWSIZEMASK); \
> + ((toast_compress_header *) (ptr))->info = (len); \
> +} while (0)
>
> Indentation.
>
> +#define TOAST_COMPRESS_SET_COMPRESSION_METHOD(ptr, cm_method) \
> + ((toast_compress_header *) (ptr))->info |= ((cm_method) << 30);
>
> What about making TOAST_COMPRESS_SET_RAWSIZE() take another argument?
> And possibly also rename it to TEST_COMPRESS_SET_SIZE_AND_METHOD() or
> something? It seems not great to have separate functions each setting
> part of a 4-byte quantity. Too much chance of failing to set both
> parts. I guess you've got a function called
> toast_set_compressed_datum_info() for that, but it's just a wrapper
> around two macros that could just be combined, which would reduce
> complexity overall.
>
> + T_CompressionRoutine, /* in access/compressionapi.h */
>
> This looks misplaced. I guess it should go just after these:
>
> T_FdwRoutine, /* in foreign/fdwapi.h */
> T_IndexAmRoutine, /* in access/amapi.h */
> T_TableAmRoutine, /* in access/tableam.h */
>
> Looking over the regression test changes:
>
> The tests at the top of create_cm.out that just test that we can
> create tables with various storage types seem unrelated to the purpose
> of the patch. And the file doesn't test creating a compression method
> either, as the file name would suggest, so either the file name needs
> to be changed (compression, compression_method?) or the tests don't go
> here.
>
> +-- check data is okdd
>
> I guess whoever is responsible for this comment prefers vi to emacs.
>
> I don't quite understand the purpose of all of these tests, and there
> are some things that I feel like ought to be tested that seemingly
> aren't. Like, you seem to test using an UPDATE to move a datum from a
> table to another table with the same compression method, but not one
> with a different compression method. Testing the former is nice and
> everything, but that's the easy case: I think we also need to test the
> latter. I think it would be good to verify not only that the data is
> readable but that it's compressed the way we expect. I think it would
> be a great idea to add a pg_column_compression() function in a similar
> spirit to pg_column_size(). Perhaps it could return NULL when
> compression is not in use or the data type is not varlena, and the
> name of the compression method otherwise. That would allow for better
> testing of this feature, and it would also be useful to users who are
> switching methods, to see what data they still have that's using the
> old method. It could be useful for debugging problems on customer
> systems, too.
>
> I wonder if we need a test that moves data between tables through an
> intermediary. For instance, suppose a plpgsql function or DO block
> fetches some data and stores it in a plpgsql variable and then uses
> the variable to insert into another table. Hmm, maybe that would force
> de-TOASTing. But perhaps there are other cases. Maybe a more general
> way to approach the problem is: have you tried running a coverage
> report and checked which parts of your code are getting exercised by
> the existing tests and which parts are not? The stuff that isn't, we
> should try to add more tests. It's easy to get corner cases wrong with
> this kind of thing.
>
> I notice that LIKE INCLUDING COMPRESSION doesn't seem to be tested, at
> least not by 0001, which reinforces my feeling that the tests here are
> not as thorough as they could be.
>
> +NOTICE: pg_compression contains unpinned initdb-created object(s)
>
> This seems wrong to me - why is it OK?
>
> - result = (struct varlena *)
> - palloc(TOAST_COMPRESS_RAWSIZE(attr) + VARHDRSZ);
> - SET_VARSIZE(result, TOAST_COMPRESS_RAWSIZE(attr) + VARHDRSZ);
> + cmoid = GetCompressionOidFromCompressionId(TOAST_COMPRESS_METHOD(attr));
>
> - if (pglz_decompress(TOAST_COMPRESS_RAWDATA(attr),
> - TOAST_COMPRESS_SIZE(attr),
> - VARDATA(result),
> -
> TOAST_COMPRESS_RAWSIZE(attr), true) < 0)
> - elog(ERROR, "compressed data is corrupted");
> + /* get compression method handler routines */
> + cmroutine = GetCompressionRoutine(cmoid);
>
> - return result;
> + return cmroutine->cmdecompress(attr);
>
> I'm worried about how expensive this might be, and I think we could
> make it cheaper. The reason why I think this might be expensive is:
> currently, for every datum, you have a single direct function call.
> Now, with this, you first have a direct function call to
> GetCompressionOidFromCompressionId(). Then you have a call to
> GetCompressionRoutine(), which does a syscache lookup and calls a
> handler function, which is quite a lot more expensive than a single
> function call. And the handler isn't even returning a statically
> allocated structure, but is allocating new memory every time, which
> involves more function calls and maybe memory leaks. Then you use the
> results of all that to make an indirect function call.
>
> I'm not sure exactly what combination of things we could use to make
> this better, but it seems like there are a few possibilities:
>
> (1) The handler function could return a pointer to the same
> CompressionRoutine every time instead of constructing a new one every
> time.
> (2) The CompressionRoutine to which the handler function returns a
> pointer could be statically allocated instead of being built at
> runtime.
> (3) GetCompressionRoutine could have an OID -> handler cache instead
> of relying on syscache + calling the handler function all over again.
> (4) For the compression types that have dedicated bit patterns in the
> high bits of the compressed TOAST size, toast_compress_datum() could
> just have hard-coded logic to use the correct handlers instead of
> translating the bit pattern into an OID and then looking it up over
> again.
> (5) Going even further than #4 we could skip the handler layer
> entirely for such methods, and just call the right function directly.
>
> I think we should definitely do (1), and also (2) unless there's some
> reason it's hard. (3) doesn't need to be part of this patch, but might
> be something to consider later in the series. It's possible that it
> doesn't have enough benefit to be worth the work, though. Also, I
> think we should do either (4) or (5). I have a mild preference for (5)
> unless it looks too ugly.
>
> Note that I'm not talking about hard-coding a fast path for a
> hard-coded list of OIDs - which would seem a little bit unprincipled -
> but hard-coding a fast path for the bit patterns that are themselves
> hard-coded. I don't think we lose anything in terms of extensibility
> or even-handedness there; it's just avoiding a bunch of rigamarole
> that doesn't really buy us anything.
>
> All these points apply equally to toast_decompress_datum_slice() and
> toast_compress_datum().
>
> + /* Fallback to default compression method, if not specified */
> + if (!OidIsValid(cmoid))
> + cmoid = DefaultCompressionOid;
>
> I think that the caller should be required to specify a legal value,
> and this should be an elog(ERROR) or an Assert().
>
> The change to equalTupleDescs() makes me wonder. Like, can we specify
> the compression method for a function parameter, or a function return
> value? I would think not. But then how are the tuple descriptors set
> up in that case? Under what circumstances do we actually need the
> tuple descriptors to compare unequal?
>
> lz4.c's header comment calls it cm_lz4.c, and the pathname is wrong too.
>
> I wonder if we should try to adopt a convention for the names of these
> files that isn't just the compression method name, like cmlz4 or
> compress_lz4. I kind of like the latter one. I am a little worried
> that just calling it lz4.c will result in name collisions later - not
> in this directory, of course, but elsewhere in the system. It's not a
> disaster if that happens, but for example verbose error reports print
> the file name, so it's nice if it's unambiguous.
>
> + if (!IsBinaryUpgrade &&
> + (relkind == RELKIND_RELATION ||
> + relkind == RELKIND_PARTITIONED_TABLE))
> + attr->attcompression =
> +
> GetAttributeCompressionMethod(attr, colDef->compression);
> + else
> + attr->attcompression = InvalidOid;
>
> Storing InvalidOid in the IsBinaryUpgrade case looks wrong. If
> upgrading from pre-v14, we need to store PGLZ_COMPRESSION_OID.
> Otherwise, we need to preserve whatever value was present in the old
> version. Or am I confused here?
>
> I think there should be tests for the way this interacts with
> partitioning, and I think the intended interaction should be
> documented. Perhaps it should behave like TABLESPACE, where the parent
> property has no effect on what gets stored because the parent has no
> storage, but is inherited by each new child.
>
> I wonder in passing about TOAST tables and materialized views, which
> are the other things that have storage. What gets stored for
> attcompression? For a TOAST table it probably doesn't matter much
> since TOAST table entries shouldn't ever be toasted themselves, so
> anything that doesn't crash is fine (but maybe we should test that
> trying to alter the compression properties of a TOAST table doesn't
> crash, for example). For a materialized view it seems reasonable to
> want to set column properties, but I'm not quite sure how that works
> today for things like STORAGE anyway. If we do allow setting STORAGE
> or COMPRESSION for materialized view columns then dump-and-reload
> needs to preserve the values.
>
> + /*
> + * Use default compression method if the existing compression method is
> + * invalid but the new storage type is non plain storage.
> + */
> + if (!OidIsValid(attrtuple->attcompression) &&
> + (newstorage != TYPSTORAGE_PLAIN))
> + attrtuple->attcompression = DefaultCompressionOid;
>
> You have a few too many parens in there.
>
> I don't see a particularly good reason to treat plain and external
> differently. More generally, I think there's a question here about
> when we need an attribute to have a valid compression type and when we
> don't. If typstorage is plan or external, then there's no point in
> ever having a compression type and maybe we should even reject
> attempts to set one (but I'm not sure). However, the attstorage is a
> different case. Suppose the column is created with extended storage
> and then later it's changed to plain. That's only a hint, so there may
> still be toasted values in that column, so the compression setting
> must endure. At any rate, we need to make sure we have clear and
> sensible rules for when attcompression (a) must be valid, (b) may be
> valid, and (c) must be invalid. And those rules need to at least be
> documented in the comments, and maybe in the SGML docs.
>
> I'm out of time for today, so I'll have to look at this more another
> day. Hope this helps for a start.
>

Thanks for the review Robert, I will work on these comments and
provide my analysis along with the updated patch in a couple of days.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

From:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, David Steele <david(at)pgmasters(dot)net>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2020-11-24 12:11:23
Message-ID:	CAFiTN-v8a7PMTWWwLG4ASNw9GZwhFh40TEwLz8oLjUnhbUGqTw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Nov 21, 2020 at 3:50 AM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

Most of the comments looks fine to me but I have a slightly different
opinion for one of the comment so replying only for that.

> I'm worried about how expensive this might be, and I think we could
> make it cheaper. The reason why I think this might be expensive is:
> currently, for every datum, you have a single direct function call.
> Now, with this, you first have a direct function call to
> GetCompressionOidFromCompressionId(). Then you have a call to
> GetCompressionRoutine(), which does a syscache lookup and calls a
> handler function, which is quite a lot more expensive than a single
> function call. And the handler isn't even returning a statically
> allocated structure, but is allocating new memory every time, which
> involves more function calls and maybe memory leaks. Then you use the
> results of all that to make an indirect function call.
>
> I'm not sure exactly what combination of things we could use to make
> this better, but it seems like there are a few possibilities:
>
> (1) The handler function could return a pointer to the same
> CompressionRoutine every time instead of constructing a new one every
> time.
> (2) The CompressionRoutine to which the handler function returns a
> pointer could be statically allocated instead of being built at
> runtime.
> (3) GetCompressionRoutine could have an OID -> handler cache instead
> of relying on syscache + calling the handler function all over again.
> (4) For the compression types that have dedicated bit patterns in the
> high bits of the compressed TOAST size, toast_compress_datum() could
> just have hard-coded logic to use the correct handlers instead of
> translating the bit pattern into an OID and then looking it up over
> again.
> (5) Going even further than #4 we could skip the handler layer
> entirely for such methods, and just call the right function directly.
>
> I think we should definitely do (1), and also (2) unless there's some
> reason it's hard. (3) doesn't need to be part of this patch, but might
> be something to consider later in the series. It's possible that it
> doesn't have enough benefit to be worth the work, though. Also, I
> think we should do either (4) or (5). I have a mild preference for (5)
> unless it looks too ugly.
>
> Note that I'm not talking about hard-coding a fast path for a
> hard-coded list of OIDs - which would seem a little bit unprincipled -
> but hard-coding a fast path for the bit patterns that are themselves
> hard-coded. I don't think we lose anything in terms of extensibility
> or even-handedness there; it's just avoiding a bunch of rigamarole
> that doesn't really buy us anything.
>
> All these points apply equally to toast_decompress_datum_slice() and
> toast_compress_datum().

I agree that (1) and (2) we shall definitely do as part of the first
patch and (3) we might do in later patches. I think from (4) and (5)
I am more inclined to do (4) for a couple of reasons
a) If we bypass the handler function and directly calls the
compression and decompression routines then we need to check whether
the current executable is compiled with this particular compression
library or not for example in 'lz4handler' we have this below check,
now if we don't have the handler function we either need to put this
in each compression/decompression functions or we need to put is in
each caller.
Datum
lz4handler(PG_FUNCTION_ARGS)
{
#ifndef HAVE_LIBLZ4
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
errmsg("not built with lz4 support")));
#else

b) Another reason is that once we start supporting the compression
options (0006-Support-compression-methods-options.patch) then we also
need to call 'cminitstate_function' for parsing the compression
options and then calling the compression function, so we need to
hardcode multiple function calls.

I think b) is still okay but because of a) I am more inclined to do
(4), what is your opinion on this?

About (4), one option is that we directly call the correct handler
function for the built-in type directly from
toast_(de)compress(_slice) functions but in that case, we are
duplicating the code, another option is that we call the
GetCompressionRoutine() a common function and in that, for the
built-in type, we can directly call the corresponding handler function
and get the routine. The only thing is to avoid duplicating in
decompression routine we need to convert CompressionId to Oid before
calling GetCompressionRoutine(), but now we can avoid sys cache lookup
for the built-in type.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Cc:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, David Steele <david(at)pgmasters(dot)net>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2020-11-24 13:44:08
Message-ID:	CA+Tgmob3W8cnLgOQX+JQzeyGN3eKGmRrBkUY6WGfNyHa+t_qEw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Nov 24, 2020 at 7:11 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> About (4), one option is that we directly call the correct handler
> function for the built-in type directly from
> toast_(de)compress(_slice) functions but in that case, we are
> duplicating the code, another option is that we call the
> GetCompressionRoutine() a common function and in that, for the
> built-in type, we can directly call the corresponding handler function
> and get the routine. The only thing is to avoid duplicating in
> decompression routine we need to convert CompressionId to Oid before
> calling GetCompressionRoutine(), but now we can avoid sys cache lookup
> for the built-in type.

Suppose that we have a variable lz4_methods (like heapam_methods) that
is always defined, whether or not lz4 support is present. It's defined
like this:

const CompressionAmRoutine lz4_compress_methods = {
.datum_compress = lz4_datum_compress,
.datum_decompress = lz4_datum_decompress,
.datum_decompress_slice = lz4_datum_decompress_slice
};

(It would be good, I think, to actually name things something like
this - in particular why would we have TableAmRoutine and
IndexAmRoutine but not include "Am" in the one for compression? In
general I think tableam is a good pattern to adhere to and we should
try to make this patch hew closely to it.)

Then those functions are contingent on #ifdef HAVE_LIBLZ4: they either
do their thing, or complain that lz4 compression is not supported.
Then in this function you can just say, well, if we have the 01 bit
pattern, handler = &lz4_compress_methods and proceed from there.

BTW, I think the "not supported" message should probably use the 'by
this build' language we use in some places i.e.

[rhaas pgsql]$ git grep errmsg.*'this build' | grep -vF .po:
contrib/pg_prewarm/pg_prewarm.c: errmsg("prefetch is not supported by
this build")));
src/backend/libpq/be-secure-openssl.c: (errmsg("\"%s\" setting \"%s\"
not supported by this build",
src/backend/libpq/be-secure-openssl.c: (errmsg("\"%s\" setting \"%s\"
not supported by this build",
src/backend/libpq/hba.c: errmsg("local connections are not supported
by this build"),
src/backend/libpq/hba.c: errmsg("hostssl record cannot match because
SSL is not supported by this build"),
src/backend/libpq/hba.c: errmsg("hostgssenc record cannot match
because GSSAPI is not supported by this build"),
src/backend/libpq/hba.c: errmsg("invalid authentication method \"%s\":
not supported by this build",
src/backend/utils/adt/pg_locale.c: errmsg("ICU is not supported in
this build"), \
src/backend/utils/misc/guc.c: GUC_check_errmsg("Bonjour is not
supported by this build");
src/backend/utils/misc/guc.c: GUC_check_errmsg("SSL is not supported
by this build");

--
Robert Haas
EDB: http://www.enterprisedb.com

From:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, David Steele <david(at)pgmasters(dot)net>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2020-11-24 15:47:02
Message-ID:	CAFiTN-v5=4LxHfGd7+gXYH2fh0NnhzgcnJxGnHjo=k=7h5Z3ZQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Nov 24, 2020 at 7:14 PM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> On Tue, Nov 24, 2020 at 7:11 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> > About (4), one option is that we directly call the correct handler
> > function for the built-in type directly from
> > toast_(de)compress(_slice) functions but in that case, we are
> > duplicating the code, another option is that we call the
> > GetCompressionRoutine() a common function and in that, for the
> > built-in type, we can directly call the corresponding handler function
> > and get the routine. The only thing is to avoid duplicating in
> > decompression routine we need to convert CompressionId to Oid before
> > calling GetCompressionRoutine(), but now we can avoid sys cache lookup
> > for the built-in type.
>
> Suppose that we have a variable lz4_methods (like heapam_methods) that
> is always defined, whether or not lz4 support is present. It's defined
> like this:
>
> const CompressionAmRoutine lz4_compress_methods = {
> .datum_compress = lz4_datum_compress,
> .datum_decompress = lz4_datum_decompress,
> .datum_decompress_slice = lz4_datum_decompress_slice
> };

Yeah, this makes sense.

>
> (It would be good, I think, to actually name things something like
> this - in particular why would we have TableAmRoutine and
> IndexAmRoutine but not include "Am" in the one for compression? In
> general I think tableam is a good pattern to adhere to and we should
> try to make this patch hew closely to it.)

For the compression routine name, I did not include "Am" because
currently, we are storing the compression method in the new catalog
"pg_compression" not in the pg_am. So are you suggesting that we
should store the compression methods also in the pg_am instead of
creating a new catalog? IMHO, storing the compression methods in a
new catalog is a better option instead of storing them in pg_am
because actually, the compression methods are not the same as heap or
index AMs, I mean they are actually not the access methods. Am I
missing something?

> Then those functions are contingent on #ifdef HAVE_LIBLZ4: they either
> do their thing, or complain that lz4 compression is not supported.
> Then in this function you can just say, well, if we have the 01 bit
> pattern, handler = &lz4_compress_methods and proceed from there.

Okay

> BTW, I think the "not supported" message should probably use the 'by
> this build' language we use in some places i.e.
>

Okay

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Cc:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, David Steele <david(at)pgmasters(dot)net>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2020-11-24 17:15:42
Message-ID:	CA+TgmobTvecShXWyQ8Hdijc7mvU9hOGufNLAeisaJaxM61C7qg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Nov 24, 2020 at 10:47 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> For the compression routine name, I did not include "Am" because
> currently, we are storing the compression method in the new catalog
> "pg_compression" not in the pg_am. So are you suggesting that we
> should store the compression methods also in the pg_am instead of
> creating a new catalog? IMHO, storing the compression methods in a
> new catalog is a better option instead of storing them in pg_am
> because actually, the compression methods are not the same as heap or
> index AMs, I mean they are actually not the access methods. Am I
> missing something?

Oh, I thought it had been suggested in previous discussions that these
should be treated as access methods rather than inventing a whole new
concept just for this, and it seemed like a good idea to me. I guess I
missed the fact that the patch wasn't doing it that way. Hmm.

--
Robert Haas
EDB: http://www.enterprisedb.com

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, David Steele <david(at)pgmasters(dot)net>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2020-11-24 18:21:38
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Tue, Nov 24, 2020 at 10:47 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>> For the compression routine name, I did not include "Am" because
>> currently, we are storing the compression method in the new catalog
>> "pg_compression" not in the pg_am. So are you suggesting that we
>> should store the compression methods also in the pg_am instead of
>> creating a new catalog? IMHO, storing the compression methods in a
>> new catalog is a better option instead of storing them in pg_am
>> because actually, the compression methods are not the same as heap or
>> index AMs, I mean they are actually not the access methods. Am I
>> missing something?

> Oh, I thought it had been suggested in previous discussions that these
> should be treated as access methods rather than inventing a whole new
> concept just for this, and it seemed like a good idea to me. I guess I
> missed the fact that the patch wasn't doing it that way. Hmm.

FWIW, I kind of agree with Robert's take on this. Heap and index AMs
are pretty fundamentally different animals, yet we don't have a problem
sticking them in the same catalog. I think anything that is related to
storage access could reasonably go into that catalog, rather than
inventing a new one.

regards, tom lane

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, David Steele <david(at)pgmasters(dot)net>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2020-11-24 18:38:09
Message-ID:	CA+TgmoYuuLKVPnN0jv6vqe=DepzMBA3_i1qxe9NsqT1ypiDNuw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Nov 24, 2020 at 1:21 PM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> FWIW, I kind of agree with Robert's take on this. Heap and index AMs
> are pretty fundamentally different animals, yet we don't have a problem
> sticking them in the same catalog. I think anything that is related to
> storage access could reasonably go into that catalog, rather than
> inventing a new one.

It's good to have your opinion on this since I wasn't totally sure
what was best, but for the record, I can't take credit. Looks like it
was Álvaro's suggestion originally:

http://postgr.es/m/[email protected]

--
Robert Haas
EDB: http://www.enterprisedb.com

From:	Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, David Steele <david(at)pgmasters(dot)net>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2020-11-24 19:20:31
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2020-Nov-24, Tom Lane wrote:

> Robert Haas <robertmhaas(at)gmail(dot)com> writes:

> > Oh, I thought it had been suggested in previous discussions that these
> > should be treated as access methods rather than inventing a whole new
> > concept just for this, and it seemed like a good idea to me. I guess I
> > missed the fact that the patch wasn't doing it that way. Hmm.
>
> FWIW, I kind of agree with Robert's take on this. Heap and index AMs
> are pretty fundamentally different animals, yet we don't have a problem
> sticking them in the same catalog. I think anything that is related to
> storage access could reasonably go into that catalog, rather than
> inventing a new one.

Right -- Something like amname=lz4, amhandler=lz4handler, amtype=c.
The core code must of course know how to instantiate an AM of type
'c' and what to use it for.

https://postgr.es/m/[email protected]

From:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To:	Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, David Steele <david(at)pgmasters(dot)net>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2020-12-01 09:45:04
Message-ID:	CAFiTN-u9Y399WyBKufHXAo1TA9TyX_9cjG+j6BnzK9UrFU12Sw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Nov 25, 2020 at 12:50 AM Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org> wrote:
>
> On 2020-Nov-24, Tom Lane wrote:
>
> > Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>
> > > Oh, I thought it had been suggested in previous discussions that these
> > > should be treated as access methods rather than inventing a whole new
> > > concept just for this, and it seemed like a good idea to me. I guess I
> > > missed the fact that the patch wasn't doing it that way. Hmm.
> >
> > FWIW, I kind of agree with Robert's take on this. Heap and index AMs
> > are pretty fundamentally different animals, yet we don't have a problem
> > sticking them in the same catalog. I think anything that is related to
> > storage access could reasonably go into that catalog, rather than
> > inventing a new one.
>
> Right -- Something like amname=lz4, amhandler=lz4handler, amtype=c.
> The core code must of course know how to instantiate an AM of type
> 'c' and what to use it for.
>
> https://postgr.es/m/[email protected]

I have changed this, I agree that using the access method for creating
compression has simplified the code. I will share the updated patch
set after fixing other review comments by Robert.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

From:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, David Steele <david(at)pgmasters(dot)net>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2020-12-01 11:20:56
Message-ID:	CAFiTN-u9+ePF_FTiMBpHNzdxmOQYj9n2cjFx+XbyyJr-vXxgOw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Nov 21, 2020 at 3:50 AM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

While working on this comment I have doubts.

> I wonder in passing about TOAST tables and materialized views, which
> are the other things that have storage. What gets stored for
> attcompression? For a TOAST table it probably doesn't matter much
> since TOAST table entries shouldn't ever be toasted themselves, so
> anything that doesn't crash is fine (but maybe we should test that
> trying to alter the compression properties of a TOAST table doesn't
> crash, for example).

Yeah for the toast table it doesn't matter, but I am not sure what do
you mean by altering the compression method for the toast table. Do you
mean manually update the pg_attribute tuple for the toast table and
set different compression methods? Or there is some direct way to
alter the toast table?

For a materialized view it seems reasonable to
> want to set column properties, but I'm not quite sure how that works
> today for things like STORAGE anyway. If we do allow setting STORAGE
> or COMPRESSION for materialized view columns then dump-and-reload
> needs to preserve the values.

I see that we allow setting the STORAGE for the materialized view but
I am not sure what is the use case. Basically, the tuples are
directly getting selected from the host table and inserted in the
materialized view without checking target and source storage type.
The behavior is the same if you execute INSERT INTO dest_table SELECT
* FROM source_table. Basically, if the source_table attribute has
extended storage and the target table has plain storage, still the
value will be inserted directly into the target table without any
conversion. However, in the table, you can insert the new tuple and
that will be stored as per the new storage method so that is still
fine but I don't know any use case for the materialized view. Now I am
thinking what should be the behavior for the materialized view?

For the materialized view can we have the same behavior as storage? I
think for the built-in compression method that might not be a problem
but for the external compression method how can we handle the
dependency, I mean when the materialized view has created the table
was having an external compression method "cm1" and we have created
the materialized view based on that now if we alter table and set the
new compression method and force table rewrite then what will happen
to the tuple inside the materialized view, I mean tuple is still
compressed with "cm1" and there is no attribute is maintaining the
dependency on "cm1" because the materialized view can point to any
compression method. Now if we drop the cm1 it will be allowed to
drop. So I think for the compression method we can consider the
materialized view same as the table, I mean we can allow setting the
compression method for the materialized view and we can always ensure
that all the tuple in this view is compressed with the current or the
preserved compression methods. So whenever we are inserting in the
materialized view then we should compare the datum compression method
with the target compression method.

> + /*
> + * Use default compression method if the existing compression method is
> + * invalid but the new storage type is non plain storage.
> + */
> + if (!OidIsValid(attrtuple->attcompression) &&
> + (newstorage != TYPSTORAGE_PLAIN))
> + attrtuple->attcompression = DefaultCompressionOid;
>
> You have a few too many parens in there.
>
> I don't see a particularly good reason to treat plain and external
> differently.

Yeah, I think they should be treated the same.

More generally, I think there's a question here about
> when we need an attribute to have a valid compression type and when we
> don't. If typstorage is plan or external, then there's no point in
> ever having a compression type and maybe we should even reject
> attempts to set one (but I'm not sure).

I agree.

> However, the attstorage is a
> different case. Suppose the column is created with extended storage
> and then later it's changed to plain. That's only a hint, so there may
> still be toasted values in that column, so the compression setting
> must endure. At any rate, we need to make sure we have clear and
> sensible rules for when attcompression (a) must be valid, (b) may be
> valid, and (c) must be invalid. And those rules need to at least be
> documented in the comments, and maybe in the SGML docs.

IIUC, even if we change the attstorage the existing tuples are stored
as it is without changing the tuple storage. So I think even if the
attstorage is changed the attcompression should not have any change.

After observing this behavior of storage I tend to think that for
built-in compression methods also we should have the same behavior, I mean
if the tuple is compressed with one of the built-in compression
methods and if we are altering the compression method or we are doing
INSERT INTO SELECT to the target field having a different compression
method then we should not rewrite/decompress those tuples. Basically,
I mean to say that the built-in compression methods can always be
treated as PRESERVE because those can not be dropped.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

From:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, David Steele <david(at)pgmasters(dot)net>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2020-12-01 15:45:44
Message-ID:	CAFiTN-sYceWDFqK_Pws11J1FB9p4VGwBHC8ZeEtEhPbCWFij=g@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Dec 1, 2020 at 4:50 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> On Sat, Nov 21, 2020 at 3:50 AM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> While working on this comment I have doubts.
>
> > I wonder in passing about TOAST tables and materialized views, which
> > are the other things that have storage. What gets stored for
> > attcompression? For a TOAST table it probably doesn't matter much
> > since TOAST table entries shouldn't ever be toasted themselves, so
> > anything that doesn't crash is fine (but maybe we should test that
> > trying to alter the compression properties of a TOAST table doesn't
> > crash, for example).
>
> Yeah for the toast table it doesn't matter, but I am not sure what do
> you mean by altering the compression method for the toast table. Do you
> mean manually update the pg_attribute tuple for the toast table and
> set different compression methods? Or there is some direct way to
> alter the toast table?
>
> For a materialized view it seems reasonable to
> > want to set column properties, but I'm not quite sure how that works
> > today for things like STORAGE anyway. If we do allow setting STORAGE
> > or COMPRESSION for materialized view columns then dump-and-reload
> > needs to preserve the values.
>
> I see that we allow setting the STORAGE for the materialized view but
> I am not sure what is the use case. Basically, the tuples are
> directly getting selected from the host table and inserted in the
> materialized view without checking target and source storage type.
> The behavior is the same if you execute INSERT INTO dest_table SELECT
> * FROM source_table. Basically, if the source_table attribute has
> extended storage and the target table has plain storage, still the
> value will be inserted directly into the target table without any
> conversion. However, in the table, you can insert the new tuple and
> that will be stored as per the new storage method so that is still
> fine but I don't know any use case for the materialized view. Now I am
> thinking what should be the behavior for the materialized view?
>
> For the materialized view can we have the same behavior as storage? I
> think for the built-in compression method that might not be a problem
> but for the external compression method how can we handle the
> dependency, I mean when the materialized view has created the table
> was having an external compression method "cm1" and we have created
> the materialized view based on that now if we alter table and set the
> new compression method and force table rewrite then what will happen
> to the tuple inside the materialized view, I mean tuple is still
> compressed with "cm1" and there is no attribute is maintaining the
> dependency on "cm1" because the materialized view can point to any
> compression method. Now if we drop the cm1 it will be allowed to
> drop. So I think for the compression method we can consider the
> materialized view same as the table, I mean we can allow setting the
> compression method for the materialized view and we can always ensure
> that all the tuple in this view is compressed with the current or the
> preserved compression methods. So whenever we are inserting in the
> materialized view then we should compare the datum compression method
> with the target compression method.
>
>
> > + /*
> > + * Use default compression method if the existing compression method is
> > + * invalid but the new storage type is non plain storage.
> > + */
> > + if (!OidIsValid(attrtuple->attcompression) &&
> > + (newstorage != TYPSTORAGE_PLAIN))
> > + attrtuple->attcompression = DefaultCompressionOid;
> >
> > You have a few too many parens in there.
> >
> > I don't see a particularly good reason to treat plain and external
> > differently.
>
> Yeah, I think they should be treated the same.
>
> More generally, I think there's a question here about
> > when we need an attribute to have a valid compression type and when we
> > don't. If typstorage is plan or external, then there's no point in
> > ever having a compression type and maybe we should even reject
> > attempts to set one (but I'm not sure).
>
> I agree.
>
> > However, the attstorage is a
> > different case. Suppose the column is created with extended storage
> > and then later it's changed to plain. That's only a hint, so there may
> > still be toasted values in that column, so the compression setting
> > must endure. At any rate, we need to make sure we have clear and
> > sensible rules for when attcompression (a) must be valid, (b) may be
> > valid, and (c) must be invalid. And those rules need to at least be
> > documented in the comments, and maybe in the SGML docs.
>
> IIUC, even if we change the attstorage the existing tuples are stored
> as it is without changing the tuple storage. So I think even if the
> attstorage is changed the attcompression should not have any change.
>

I have put some more thought into this and IMHO the rules should be as below

1. If attstorage is EXTENDED -> attcompression "must be valid"
2. if attstorage is PLAIN/EXTERNAL -> atttcompression "maybe valid"
3. if typstorage is PLAIN/EXTERNAL -> atttcompression "must be invalid"

I am a little bit confused about (2), basically, it will be valid in
the scenario u mentioned that change the atttstorege from EXTENDED to
PLAIN/EXTERNAL. But I think in this case also we can just set the
attcompression to invalid, however, we have to maintain the dependency
between attribute and compression method so that the old methods using
which we might have compressed a few tuples in the table doesn't get
dropped.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

From:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, David Steele <david(at)pgmasters(dot)net>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2020-12-04 12:16:34
Message-ID:	CAFiTN-tzTTT2oqWdRGLv1dvvS5MC1W+LE+3bqWPJUZj4GnHOJg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Dec 1, 2020 at 9:15 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> On Tue, Dec 1, 2020 at 4:50 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> >
> > On Sat, Nov 21, 2020 at 3:50 AM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> >
> > While working on this comment I have doubts.
> >
> > > I wonder in passing about TOAST tables and materialized views, which
> > > are the other things that have storage. What gets stored for
> > > attcompression? For a TOAST table it probably doesn't matter much
> > > since TOAST table entries shouldn't ever be toasted themselves, so
> > > anything that doesn't crash is fine (but maybe we should test that
> > > trying to alter the compression properties of a TOAST table doesn't
> > > crash, for example).
> >
> > Yeah for the toast table it doesn't matter, but I am not sure what do
> > you mean by altering the compression method for the toast table. Do you
> > mean manually update the pg_attribute tuple for the toast table and
> > set different compression methods? Or there is some direct way to
> > alter the toast table?
> >
> > For a materialized view it seems reasonable to
> > > want to set column properties, but I'm not quite sure how that works
> > > today for things like STORAGE anyway. If we do allow setting STORAGE
> > > or COMPRESSION for materialized view columns then dump-and-reload
> > > needs to preserve the values.
> >
> > I see that we allow setting the STORAGE for the materialized view but
> > I am not sure what is the use case. Basically, the tuples are
> > directly getting selected from the host table and inserted in the
> > materialized view without checking target and source storage type.
> > The behavior is the same if you execute INSERT INTO dest_table SELECT
> > * FROM source_table. Basically, if the source_table attribute has
> > extended storage and the target table has plain storage, still the
> > value will be inserted directly into the target table without any
> > conversion. However, in the table, you can insert the new tuple and
> > that will be stored as per the new storage method so that is still
> > fine but I don't know any use case for the materialized view. Now I am
> > thinking what should be the behavior for the materialized view?
> >
> > For the materialized view can we have the same behavior as storage? I
> > think for the built-in compression method that might not be a problem
> > but for the external compression method how can we handle the
> > dependency, I mean when the materialized view has created the table
> > was having an external compression method "cm1" and we have created
> > the materialized view based on that now if we alter table and set the
> > new compression method and force table rewrite then what will happen
> > to the tuple inside the materialized view, I mean tuple is still
> > compressed with "cm1" and there is no attribute is maintaining the
> > dependency on "cm1" because the materialized view can point to any
> > compression method. Now if we drop the cm1 it will be allowed to
> > drop. So I think for the compression method we can consider the
> > materialized view same as the table, I mean we can allow setting the
> > compression method for the materialized view and we can always ensure
> > that all the tuple in this view is compressed with the current or the
> > preserved compression methods. So whenever we are inserting in the
> > materialized view then we should compare the datum compression method
> > with the target compression method.

As per the offlist discussion with Robert, for materialized/table we
will always compress the value as per the target attribute compression
method. So if we are creating/refreshing the materialized view and
the attcompression for the target attribute is different than the
source table then we will decompress it and then compress it back as
per the target table/view.

> >
> >
> > > + /*
> > > + * Use default compression method if the existing compression method is
> > > + * invalid but the new storage type is non plain storage.
> > > + */
> > > + if (!OidIsValid(attrtuple->attcompression) &&
> > > + (newstorage != TYPSTORAGE_PLAIN))
> > > + attrtuple->attcompression = DefaultCompressionOid;
> > >
> > > You have a few too many parens in there.
> > >
> > > I don't see a particularly good reason to treat plain and external
> > > differently.
> >
> > Yeah, I think they should be treated the same.
> >
> > More generally, I think there's a question here about
> > > when we need an attribute to have a valid compression type and when we
> > > don't. If typstorage is plan or external, then there's no point in
> > > ever having a compression type and maybe we should even reject
> > > attempts to set one (but I'm not sure).
> >
> > I agree.
> >
> > > However, the attstorage is a
> > > different case. Suppose the column is created with extended storage
> > > and then later it's changed to plain. That's only a hint, so there may
> > > still be toasted values in that column, so the compression setting
> > > must endure. At any rate, we need to make sure we have clear and
> > > sensible rules for when attcompression (a) must be valid, (b) may be
> > > valid, and (c) must be invalid. And those rules need to at least be
> > > documented in the comments, and maybe in the SGML docs.
> >
> > IIUC, even if we change the attstorage the existing tuples are stored
> > as it is without changing the tuple storage. So I think even if the
> > attstorage is changed the attcompression should not have any change.
> >
>
> I have put some more thought into this and IMHO the rules should be as below
>
> 1. If attstorage is EXTENDED -> attcompression "must be valid"
> 2. if attstorage is PLAIN/EXTERNAL -> atttcompression "maybe valid"
> 3. if typstorage is PLAIN/EXTERNAL -> atttcompression "must be invalid"
>
> I am a little bit confused about (2), basically, it will be valid in
> the scenario u mentioned that change the atttstorege from EXTENDED to
> PLAIN/EXTERNAL. But I think in this case also we can just set the
> attcompression to invalid, however, we have to maintain the dependency
> between attribute and compression method so that the old methods using
> which we might have compressed a few tuples in the table doesn't get
> dropped.

For this also I had an offlist discussion with Robert and we decided
that it make sense to always have a valid compression method stored in
the attribute if the attribute type is compressible irrespective of
what is the current attribute storage. For example, if the attribute
type is varchar then it will always have a valid compression method,
it does not matter even if the att storage is plain or external.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

From:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, David Steele <david(at)pgmasters(dot)net>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2020-12-09 12:07:27
Message-ID:	CAFiTN-tot88hBPRz3ahJyeseEssB-4czURd7r8jP-LxmnG8=dQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

I have added that.

> Design level considerations and overall notes:
>
> configure is autogenerated from configure.in, so the patch shouldn't
> include changes only to the former.

Yeah, I missed those changes. Done now.

> Looking over the changes to src/include:
>
> + PGLZ_COMPRESSION_ID,
> + LZ4_COMPRESSION_ID
>
> I think that it would be good to assign values to these explicitly.

Done

> +/* compresion handler routines */
>
> Spelling.

Done

> + /* compression routine for the compression method */
> + cmcompress_function cmcompress;
> +
> + /* decompression routine for the compression method */
> + cmcompress_function cmdecompress;
>
> Don't reuse cmcompress_function; that's confusing. Just have a typedef
> per structure member, even if they end up being the same.

Fixed as suggested

> #define TOAST_COMPRESS_SET_RAWSIZE(ptr, len) \
> - (((toast_compress_header *) (ptr))->rawsize = (len))
> +do { \
> + Assert(len > 0 && len <= RAWSIZEMASK); \
> + ((toast_compress_header *) (ptr))->info = (len); \
> +} while (0)
>
> Indentation.

Done

> +#define TOAST_COMPRESS_SET_COMPRESSION_METHOD(ptr, cm_method) \
> + ((toast_compress_header *) (ptr))->info |= ((cm_method) << 30);
>
> What about making TOAST_COMPRESS_SET_RAWSIZE() take another argument?
> And possibly also rename it to TEST_COMPRESS_SET_SIZE_AND_METHOD() or
> something? It seems not great to have separate functions each setting
> part of a 4-byte quantity. Too much chance of failing to set both
> parts. I guess you've got a function called
> toast_set_compressed_datum_info() for that, but it's just a wrapper
> around two macros that could just be combined, which would reduce
> complexity overall.

Done that way

> + T_CompressionRoutine, /* in access/compressionapi.h */
>
> This looks misplaced. I guess it should go just after these:
>
> T_FdwRoutine, /* in foreign/fdwapi.h */
> T_IndexAmRoutine, /* in access/amapi.h */
> T_TableAmRoutine, /* in access/tableam.h */

Done

> Looking over the regression test changes:
>
> The tests at the top of create_cm.out that just test that we can
> create tables with various storage types seem unrelated to the purpose
> of the patch. And the file doesn't test creating a compression method
> either, as the file name would suggest, so either the file name needs
> to be changed (compression, compression_method?) or the tests don't go
> here.

Changed to "compression"

> +-- check data is okdd
>
> I guess whoever is responsible for this comment prefers vi to emacs.

Fixed

> I don't quite understand the purpose of all of these tests, and there
> are some things that I feel like ought to be tested that seemingly
> aren't. Like, you seem to test using an UPDATE to move a datum from a
> table to another table with the same compression method, but not one
> with a different compression method.

Added test for this, and some other tests to improve overall coverage.

Testing the former is nice and
> everything, but that's the easy case: I think we also need to test the
> latter. I think it would be good to verify not only that the data is
> readable but that it's compressed the way we expect. I think it would
> be a great idea to add a pg_column_compression() function in a similar
> spirit to pg_column_size(). Perhaps it could return NULL when
> compression is not in use or the data type is not varlena, and the
> name of the compression method otherwise. That would allow for better
> testing of this feature, and it would also be useful to users who are
> switching methods, to see what data they still have that's using the
> old method. It could be useful for debugging problems on customer
> systems, too.

This is a really great idea, I have added this function and used in my test.

> I wonder if we need a test that moves data between tables through an
> intermediary. For instance, suppose a plpgsql function or DO block
> fetches some data and stores it in a plpgsql variable and then uses
> the variable to insert into another table. Hmm, maybe that would force
> de-TOASTing. But perhaps there are other cases. Maybe a more general
> way to approach the problem is: have you tried running a coverage
> report and checked which parts of your code are getting exercised by
> the existing tests and which parts are not? The stuff that isn't, we
> should try to add more tests. It's easy to get corner cases wrong with
> this kind of thing.
>
> I notice that LIKE INCLUDING COMPRESSION doesn't seem to be tested, at
> least not by 0001, which reinforces my feeling that the tests here are
> not as thorough as they could be.

Added test for this as well.

> +NOTICE: pg_compression contains unpinned initdb-created object(s)

> This seems wrong to me - why is it OK?

Yeah, this is wrong, now fixed.

> - result = (struct varlena *)
> - palloc(TOAST_COMPRESS_RAWSIZE(attr) + VARHDRSZ);
> - SET_VARSIZE(result, TOAST_COMPRESS_RAWSIZE(attr) + VARHDRSZ);
> + cmoid = GetCompressionOidFromCompressionId(TOAST_COMPRESS_METHOD(attr));
>
> - if (pglz_decompress(TOAST_COMPRESS_RAWDATA(attr),
> - TOAST_COMPRESS_SIZE(attr),
> - VARDATA(result),
> -
> TOAST_COMPRESS_RAWSIZE(attr), true) < 0)
> - elog(ERROR, "compressed data is corrupted");
> + /* get compression method handler routines */
> + cmroutine = GetCompressionRoutine(cmoid);
>
> - return result;
> + return cmroutine->cmdecompress(attr);
>
> I'm worried about how expensive this might be, and I think we could
> make it cheaper. The reason why I think this might be expensive is:
> currently, for every datum, you have a single direct function call.
> Now, with this, you first have a direct function call to
> GetCompressionOidFromCompressionId(). Then you have a call to
> GetCompressionRoutine(), which does a syscache lookup and calls a
> handler function, which is quite a lot more expensive than a single
> function call. And the handler isn't even returning a statically
> allocated structure, but is allocating new memory every time, which
> involves more function calls and maybe memory leaks. Then you use the
> results of all that to make an indirect function call.
>
> I'm not sure exactly what combination of things we could use to make
> this better, but it seems like there are a few possibilities:
>
> (1) The handler function could return a pointer to the same
> CompressionRoutine every time instead of constructing a new one every
> time.
> (2) The CompressionRoutine to which the handler function returns a
> pointer could be statically allocated instead of being built at
> runtime.
> (3) GetCompressionRoutine could have an OID -> handler cache instead
> of relying on syscache + calling the handler function all over again.
> (4) For the compression types that have dedicated bit patterns in the
> high bits of the compressed TOAST size, toast_compress_datum() could
> just have hard-coded logic to use the correct handlers instead of
> translating the bit pattern into an OID and then looking it up over
> again.
> (5) Going even further than #4 we could skip the handler layer
> entirely for such methods, and just call the right function directly.
> I think we should definitely do (1), and also (2) unless there's some
> reason it's hard. (3) doesn't need to be part of this patch, but might
> be something to consider later in the series. It's possible that it
> doesn't have enough benefit to be worth the work, though. Also, I
> think we should do either (4) or (5). I have a mild preference for (5)
> unless it looks too ugly.
> Note that I'm not talking about hard-coding a fast path for a
> hard-coded list of OIDs - which would seem a little bit unprincipled -
> but hard-coding a fast path for the bit patterns that are themselves
> hard-coded. I don't think we lose anything in terms of extensibility
> or even-handedness there; it's just avoiding a bunch of rigamarole
> that doesn't really buy us anything.
>
> All these points apply equally to toast_decompress_datum_slice() and
> toast_compress_datum().

Fixed as discussed at [1]

> + /* Fallback to default compression method, if not specified */
> + if (!OidIsValid(cmoid))
> + cmoid = DefaultCompressionOid;
>
> I think that the caller should be required to specify a legal value,
> and this should be an elog(ERROR) or an Assert().
>
> The change to equalTupleDescs() makes me wonder. Like, can we specify
> the compression method for a function parameter, or a function return
> value? I would think not. But then how are the tuple descriptors set
> up in that case? Under what circumstances do we actually need the
> tuple descriptors to compare unequal?

If we alter the compression method then we check whether we need to
rebuild the tuple descriptor or not based on what value is changed so
if the attribute compression method is changed we need to rebuild the
compression method right. You might say that in the first patch we
are not allowing altering the compression method so we might move this
to the second patch but I thought since we added this field to
pg_attribute in this patch then better to add this check as well.
What am I missing?

> lz4.c's header comment calls it cm_lz4.c, and the pathname is wrong too.
>
> I wonder if we should try to adopt a convention for the names of these
> files that isn't just the compression method name, like cmlz4 or
> compress_lz4. I kind of like the latter one. I am a little worried
> that just calling it lz4.c will result in name collisions later - not
> in this directory, of course, but elsewhere in the system. It's not a
> disaster if that happens, but for example verbose error reports print
> the file name, so it's nice if it's unambiguous.

Changed to compress_lz4.

> + if (!IsBinaryUpgrade &&
> + (relkind == RELKIND_RELATION ||
> + relkind == RELKIND_PARTITIONED_TABLE))
> + attr->attcompression =
> +
> GetAttributeCompressionMethod(attr, colDef->compression);
> + else
> + attr->attcompression = InvalidOid;
>
> Storing InvalidOid in the IsBinaryUpgrade case looks wrong. If
> upgrading from pre-v14, we need to store PGLZ_COMPRESSION_OID.
> Otherwise, we need to preserve whatever value was present in the old
> version. Or am I confused here?

Okay, so I think we can simply remove the IsBinaryUpgrade check so it
will behave as expected. Basically, now it the compression method is
specified then it will take that compression method and if it is not
specified then it will take the PGLZ_COMPRESSION_OID.

> I think there should be tests for the way this interacts with
> partitioning, and I think the intended interaction should be
> documented. Perhaps it should behave like TABLESPACE, where the parent
> property has no effect on what gets stored because the parent has no
> storage, but is inherited by each new child.

I have added the test for this and also documented the same.

> I wonder in passing about TOAST tables and materialized views, which
> are the other things that have storage. What gets stored for
> attcompression?

I have changed this to store the Invalid compression method always.

For a TOAST table it probably doesn't matter much
> since TOAST table entries shouldn't ever be toasted themselves, so
> anything that doesn't crash is fine (but maybe we should test that
> trying to alter the compression properties of a TOAST table doesn't
> crash, for example).

You mean to update the pg_attribute table for the toasted field (e.g
chunk_data) and set the attcompression to something valid? Or there
is a better way to write this test?

Fixed as described as [2]

Fixed

> I don't see a particularly good reason to treat plain and external
> differently. More generally, I think there's a question here about
> when we need an attribute to have a valid compression type and when we
> don't. If typstorage is plan or external, then there's no point in
> ever having a compression type and maybe we should even reject
> attempts to set one (but I'm not sure). However, the attstorage is a
> different case. Suppose the column is created with extended storage
> and then later it's changed to plain. That's only a hint, so there may
> still be toasted values in that column, so the compression setting
> must endure. At any rate, we need to make sure we have clear and
> sensible rules for when attcompression (a) must be valid, (b) may be
> valid, and (c) must be invalid. And those rules need to at least be
> documented in the comments, and maybe in the SGML docs.
>
> I'm out of time for today, so I'll have to look at this more another
> day. Hope this helps for a start.

Fixed as I have described at [2], and the rules are documented in
pg_attribute.h (atop attcompression field)

[1] https://www.postgresql.org/message-id/CA%2BTgmob3W8cnLgOQX%2BJQzeyGN3eKGmRrBkUY6WGfNyHa%2Bt_qEw%40mail.gmail.com
[2] https://www.postgresql.org/message-id/CAFiTN-tzTTT2oqWdRGLv1dvvS5MC1W%2BLE%2B3bqWPJUZj4GnHOJg%40mail.gmail.com

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment	Content-Type	Size
v15-0002-alter-table-set-compression.patch	application/octet-stream	17.8 KB
v15-0005-new-compression-method-extension-for-zlib.patch	application/octet-stream	9.9 KB
v15-0003-Add-support-for-PRESERVE.patch	application/octet-stream	44.0 KB
v15-0004-Create-custom-compression-methods.patch	application/octet-stream	29.3 KB
v15-0001-Built-in-compression-method.patch	application/octet-stream	223.7 KB
v15-0006-Support-compression-methods-options.patch	application/octet-stream	54.2 KB

From:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, David Steele <david(at)pgmasters(dot)net>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2020-12-17 05:25:58
Message-ID:	CAFiTN-vh==jT9KdNAeneY=LRNdH6ZH255GBHZW6PiN1QpmQeww@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Dec 9, 2020 at 5:37 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> On Sat, Nov 21, 2020 at 3:50 AM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> >
> > On Wed, Nov 11, 2020 at 9:39 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> > > There were a few problems in this rebased version, basically, the
> > > compression options were not passed while compressing values from the
> > > brin_form_tuple, so I have fixed this.
> >
> > Since the authorship history of this patch is complicated, it would be
> > nice if you would include authorship information and relevant
> > "Discussion" links in the patches.
>
> I have added that.
>
> > Design level considerations and overall notes:
> >
> > configure is autogenerated from configure.in, so the patch shouldn't
> > include changes only to the former.
>
> Yeah, I missed those changes. Done now.
>
> > Looking over the changes to src/include:
> >
> > + PGLZ_COMPRESSION_ID,
> > + LZ4_COMPRESSION_ID
> >
> > I think that it would be good to assign values to these explicitly.
>
> Done
>
> > +/* compresion handler routines */
> >
> > Spelling.
>
> Done
>
> > + /* compression routine for the compression method */
> > + cmcompress_function cmcompress;
> > +
> > + /* decompression routine for the compression method */
> > + cmcompress_function cmdecompress;
> >
> > Don't reuse cmcompress_function; that's confusing. Just have a typedef
> > per structure member, even if they end up being the same.
>
> Fixed as suggested
>
> > #define TOAST_COMPRESS_SET_RAWSIZE(ptr, len) \
> > - (((toast_compress_header *) (ptr))->rawsize = (len))
> > +do { \
> > + Assert(len > 0 && len <= RAWSIZEMASK); \
> > + ((toast_compress_header *) (ptr))->info = (len); \
> > +} while (0)
> >
> > Indentation.
>
> Done
>
> > +#define TOAST_COMPRESS_SET_COMPRESSION_METHOD(ptr, cm_method) \
> > + ((toast_compress_header *) (ptr))->info |= ((cm_method) << 30);
> >
> > What about making TOAST_COMPRESS_SET_RAWSIZE() take another argument?
> > And possibly also rename it to TEST_COMPRESS_SET_SIZE_AND_METHOD() or
> > something? It seems not great to have separate functions each setting
> > part of a 4-byte quantity. Too much chance of failing to set both
> > parts. I guess you've got a function called
> > toast_set_compressed_datum_info() for that, but it's just a wrapper
> > around two macros that could just be combined, which would reduce
> > complexity overall.
>
> Done that way
>
> > + T_CompressionRoutine, /* in access/compressionapi.h */
> >
> > This looks misplaced. I guess it should go just after these:
> >
> > T_FdwRoutine, /* in foreign/fdwapi.h */
> > T_IndexAmRoutine, /* in access/amapi.h */
> > T_TableAmRoutine, /* in access/tableam.h */
>
> Done
>
> > Looking over the regression test changes:
> >
> > The tests at the top of create_cm.out that just test that we can
> > create tables with various storage types seem unrelated to the purpose
> > of the patch. And the file doesn't test creating a compression method
> > either, as the file name would suggest, so either the file name needs
> > to be changed (compression, compression_method?) or the tests don't go
> > here.
>
> Changed to "compression"
>
> > +-- check data is okdd
> >
> > I guess whoever is responsible for this comment prefers vi to emacs.
>
> Fixed
>
> > I don't quite understand the purpose of all of these tests, and there
> > are some things that I feel like ought to be tested that seemingly
> > aren't. Like, you seem to test using an UPDATE to move a datum from a
> > table to another table with the same compression method, but not one
> > with a different compression method.
>
> Added test for this, and some other tests to improve overall coverage.
>
> Testing the former is nice and
> > everything, but that's the easy case: I think we also need to test the
> > latter. I think it would be good to verify not only that the data is
> > readable but that it's compressed the way we expect. I think it would
> > be a great idea to add a pg_column_compression() function in a similar
> > spirit to pg_column_size(). Perhaps it could return NULL when
> > compression is not in use or the data type is not varlena, and the
> > name of the compression method otherwise. That would allow for better
> > testing of this feature, and it would also be useful to users who are
> > switching methods, to see what data they still have that's using the
> > old method. It could be useful for debugging problems on customer
> > systems, too.
>
> This is a really great idea, I have added this function and used in my test.
>
> > I wonder if we need a test that moves data between tables through an
> > intermediary. For instance, suppose a plpgsql function or DO block
> > fetches some data and stores it in a plpgsql variable and then uses
> > the variable to insert into another table. Hmm, maybe that would force
> > de-TOASTing. But perhaps there are other cases. Maybe a more general
> > way to approach the problem is: have you tried running a coverage
> > report and checked which parts of your code are getting exercised by
> > the existing tests and which parts are not? The stuff that isn't, we
> > should try to add more tests. It's easy to get corner cases wrong with
> > this kind of thing.
> >
> > I notice that LIKE INCLUDING COMPRESSION doesn't seem to be tested, at
> > least not by 0001, which reinforces my feeling that the tests here are
> > not as thorough as they could be.
>
> Added test for this as well.
>
> > +NOTICE: pg_compression contains unpinned initdb-created object(s)
>
> > This seems wrong to me - why is it OK?
>
> Yeah, this is wrong, now fixed.
>
> > - result = (struct varlena *)
> > - palloc(TOAST_COMPRESS_RAWSIZE(attr) + VARHDRSZ);
> > - SET_VARSIZE(result, TOAST_COMPRESS_RAWSIZE(attr) + VARHDRSZ);
> > + cmoid = GetCompressionOidFromCompressionId(TOAST_COMPRESS_METHOD(attr));
> >
> > - if (pglz_decompress(TOAST_COMPRESS_RAWDATA(attr),
> > - TOAST_COMPRESS_SIZE(attr),
> > - VARDATA(result),
> > -
> > TOAST_COMPRESS_RAWSIZE(attr), true) < 0)
> > - elog(ERROR, "compressed data is corrupted");
> > + /* get compression method handler routines */
> > + cmroutine = GetCompressionRoutine(cmoid);
> >
> > - return result;
> > + return cmroutine->cmdecompress(attr);
> >
> > I'm worried about how expensive this might be, and I think we could
> > make it cheaper. The reason why I think this might be expensive is:
> > currently, for every datum, you have a single direct function call.
> > Now, with this, you first have a direct function call to
> > GetCompressionOidFromCompressionId(). Then you have a call to
> > GetCompressionRoutine(), which does a syscache lookup and calls a
> > handler function, which is quite a lot more expensive than a single
> > function call. And the handler isn't even returning a statically
> > allocated structure, but is allocating new memory every time, which
> > involves more function calls and maybe memory leaks. Then you use the
> > results of all that to make an indirect function call.
> >
> > I'm not sure exactly what combination of things we could use to make
> > this better, but it seems like there are a few possibilities:
> >
> > (1) The handler function could return a pointer to the same
> > CompressionRoutine every time instead of constructing a new one every
> > time.
> > (2) The CompressionRoutine to which the handler function returns a
> > pointer could be statically allocated instead of being built at
> > runtime.
> > (3) GetCompressionRoutine could have an OID -> handler cache instead
> > of relying on syscache + calling the handler function all over again.
> > (4) For the compression types that have dedicated bit patterns in the
> > high bits of the compressed TOAST size, toast_compress_datum() could
> > just have hard-coded logic to use the correct handlers instead of
> > translating the bit pattern into an OID and then looking it up over
> > again.
> > (5) Going even further than #4 we could skip the handler layer
> > entirely for such methods, and just call the right function directly.
> > I think we should definitely do (1), and also (2) unless there's some
> > reason it's hard. (3) doesn't need to be part of this patch, but might
> > be something to consider later in the series. It's possible that it
> > doesn't have enough benefit to be worth the work, though. Also, I
> > think we should do either (4) or (5). I have a mild preference for (5)
> > unless it looks too ugly.
> > Note that I'm not talking about hard-coding a fast path for a
> > hard-coded list of OIDs - which would seem a little bit unprincipled -
> > but hard-coding a fast path for the bit patterns that are themselves
> > hard-coded. I don't think we lose anything in terms of extensibility
> > or even-handedness there; it's just avoiding a bunch of rigamarole
> > that doesn't really buy us anything.
> >
> > All these points apply equally to toast_decompress_datum_slice() and
> > toast_compress_datum().
>
> Fixed as discussed at [1]
>
> > + /* Fallback to default compression method, if not specified */
> > + if (!OidIsValid(cmoid))
> > + cmoid = DefaultCompressionOid;
> >
> > I think that the caller should be required to specify a legal value,
> > and this should be an elog(ERROR) or an Assert().
> >
> > The change to equalTupleDescs() makes me wonder. Like, can we specify
> > the compression method for a function parameter, or a function return
> > value? I would think not. But then how are the tuple descriptors set
> > up in that case? Under what circumstances do we actually need the
> > tuple descriptors to compare unequal?
>
> If we alter the compression method then we check whether we need to
> rebuild the tuple descriptor or not based on what value is changed so
> if the attribute compression method is changed we need to rebuild the
> compression method right. You might say that in the first patch we
> are not allowing altering the compression method so we might move this
> to the second patch but I thought since we added this field to
> pg_attribute in this patch then better to add this check as well.
> What am I missing?
>
> > lz4.c's header comment calls it cm_lz4.c, and the pathname is wrong too.
> >
> > I wonder if we should try to adopt a convention for the names of these
> > files that isn't just the compression method name, like cmlz4 or
> > compress_lz4. I kind of like the latter one. I am a little worried
> > that just calling it lz4.c will result in name collisions later - not
> > in this directory, of course, but elsewhere in the system. It's not a
> > disaster if that happens, but for example verbose error reports print
> > the file name, so it's nice if it's unambiguous.
>
> Changed to compress_lz4.
>
> > + if (!IsBinaryUpgrade &&
> > + (relkind == RELKIND_RELATION ||
> > + relkind == RELKIND_PARTITIONED_TABLE))
> > + attr->attcompression =
> > +
> > GetAttributeCompressionMethod(attr, colDef->compression);
> > + else
> > + attr->attcompression = InvalidOid;
> >
> > Storing InvalidOid in the IsBinaryUpgrade case looks wrong. If
> > upgrading from pre-v14, we need to store PGLZ_COMPRESSION_OID.
> > Otherwise, we need to preserve whatever value was present in the old
> > version. Or am I confused here?
>
> Okay, so I think we can simply remove the IsBinaryUpgrade check so it
> will behave as expected. Basically, now it the compression method is
> specified then it will take that compression method and if it is not
> specified then it will take the PGLZ_COMPRESSION_OID.
>
> > I think there should be tests for the way this interacts with
> > partitioning, and I think the intended interaction should be
> > documented. Perhaps it should behave like TABLESPACE, where the parent
> > property has no effect on what gets stored because the parent has no
> > storage, but is inherited by each new child.
>
> I have added the test for this and also documented the same.
>
> > I wonder in passing about TOAST tables and materialized views, which
> > are the other things that have storage. What gets stored for
> > attcompression?
>
> I have changed this to store the Invalid compression method always.
>
> For a TOAST table it probably doesn't matter much
> > since TOAST table entries shouldn't ever be toasted themselves, so
> > anything that doesn't crash is fine (but maybe we should test that
> > trying to alter the compression properties of a TOAST table doesn't
> > crash, for example).
>
> You mean to update the pg_attribute table for the toasted field (e.g
> chunk_data) and set the attcompression to something valid? Or there
> is a better way to write this test?
>
> For a materialized view it seems reasonable to
> > want to set column properties, but I'm not quite sure how that works
> > today for things like STORAGE anyway. If we do allow setting STORAGE
> > or COMPRESSION for materialized view columns then dump-and-reload
> > needs to preserve the values.
>
> Fixed as described as [2]
>
> > + /*
> > + * Use default compression method if the existing compression method is
> > + * invalid but the new storage type is non plain storage.
> > + */
> > + if (!OidIsValid(attrtuple->attcompression) &&
> > + (newstorage != TYPSTORAGE_PLAIN))
> > + attrtuple->attcompression = DefaultCompressionOid;
> >
> > You have a few too many parens in there.
>
> Fixed
>
> > I don't see a particularly good reason to treat plain and external
> > differently. More generally, I think there's a question here about
> > when we need an attribute to have a valid compression type and when we
> > don't. If typstorage is plan or external, then there's no point in
> > ever having a compression type and maybe we should even reject
> > attempts to set one (but I'm not sure). However, the attstorage is a
> > different case. Suppose the column is created with extended storage
> > and then later it's changed to plain. That's only a hint, so there may
> > still be toasted values in that column, so the compression setting
> > must endure. At any rate, we need to make sure we have clear and
> > sensible rules for when attcompression (a) must be valid, (b) may be
> > valid, and (c) must be invalid. And those rules need to at least be
> > documented in the comments, and maybe in the SGML docs.
> >
> > I'm out of time for today, so I'll have to look at this more another
> > day. Hope this helps for a start.
>
> Fixed as I have described at [2], and the rules are documented in
> pg_attribute.h (atop attcompression field)
>
> [1] https://www.postgresql.org/message-id/CA%2BTgmob3W8cnLgOQX%2BJQzeyGN3eKGmRrBkUY6WGfNyHa%2Bt_qEw%40mail.gmail.com
> [2] https://www.postgresql.org/message-id/CAFiTN-tzTTT2oqWdRGLv1dvvS5MC1W%2BLE%2B3bqWPJUZj4GnHOJg%40mail.gmail.com
>

I was working on analyzing the behavior of how the attribute merging
should work for the compression method for an inherited child so for
that, I was analyzing the behavior for the storage method. I found
some behavior that doesn't seem right. Basically, while creating the
inherited child we don't allow the storage to be different than the
parent attribute's storage but later we are allowed to alter that, is
that correct behavior.

Here is the test case to demonstrate this.

postgres[12546]=# create table t (a varchar);
postgres[12546]=# alter table t ALTER COLUMN a SET STORAGE plain;
postgres[12546]=# create table t1 (a varchar);
postgres[12546]=# alter table t1 ALTER COLUMN a SET STORAGE external;

/* Not allowing to set the external because parent attribute has plain */
postgres[12546]=# create table t2 (LIKE t1 INCLUDING STORAGE) INHERITS ( t);
NOTICE: 00000: merging column "a" with inherited definition
LOCATION: MergeAttributes, tablecmds.c:2685
ERROR: 42804: column "a" has a storage parameter conflict
DETAIL: PLAIN versus EXTERNAL
LOCATION: MergeAttributes, tablecmds.c:2730

postgres[12546]=# create table t2 (LIKE t1 ) INHERITS (t);

/* But you can alter now */
postgres[12546]=# alter TABLE t2 ALTER COLUMN a SET STORAGE EXTERNAL ;

Child tables: t2
Access method: heap

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

From:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, David Steele <david(at)pgmasters(dot)net>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2020-12-25 09:34:05
Message-ID:	CAFiTN-sh-rwf+8vQfqD6F_ExBR2r=XK7V4AuKJm=znf3WM9MAA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Dec 17, 2020 at 10:55 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> On Wed, Dec 9, 2020 at 5:37 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> >
> > On Sat, Nov 21, 2020 at 3:50 AM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> > >
> > > On Wed, Nov 11, 2020 at 9:39 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> > > > There were a few problems in this rebased version, basically, the
> > > > compression options were not passed while compressing values from the
> > > > brin_form_tuple, so I have fixed this.
> > >
> > > Since the authorship history of this patch is complicated, it would be
> > > nice if you would include authorship information and relevant
> > > "Discussion" links in the patches.
> >
> > I have added that.
> >
> > > Design level considerations and overall notes:
> > >
> > > configure is autogenerated from configure.in, so the patch shouldn't
> > > include changes only to the former.
> >
> > Yeah, I missed those changes. Done now.
> >
> > > Looking over the changes to src/include:
> > >
> > > + PGLZ_COMPRESSION_ID,
> > > + LZ4_COMPRESSION_ID
> > >
> > > I think that it would be good to assign values to these explicitly.
> >
> > Done
> >
> > > +/* compresion handler routines */
> > >
> > > Spelling.
> >
> > Done
> >
> > > + /* compression routine for the compression method */
> > > + cmcompress_function cmcompress;
> > > +
> > > + /* decompression routine for the compression method */
> > > + cmcompress_function cmdecompress;
> > >
> > > Don't reuse cmcompress_function; that's confusing. Just have a typedef
> > > per structure member, even if they end up being the same.
> >
> > Fixed as suggested
> >
> > > #define TOAST_COMPRESS_SET_RAWSIZE(ptr, len) \
> > > - (((toast_compress_header *) (ptr))->rawsize = (len))
> > > +do { \
> > > + Assert(len > 0 && len <= RAWSIZEMASK); \
> > > + ((toast_compress_header *) (ptr))->info = (len); \
> > > +} while (0)
> > >
> > > Indentation.
> >
> > Done
> >
> > > +#define TOAST_COMPRESS_SET_COMPRESSION_METHOD(ptr, cm_method) \
> > > + ((toast_compress_header *) (ptr))->info |= ((cm_method) << 30);
> > >
> > > What about making TOAST_COMPRESS_SET_RAWSIZE() take another argument?
> > > And possibly also rename it to TEST_COMPRESS_SET_SIZE_AND_METHOD() or
> > > something? It seems not great to have separate functions each setting
> > > part of a 4-byte quantity. Too much chance of failing to set both
> > > parts. I guess you've got a function called
> > > toast_set_compressed_datum_info() for that, but it's just a wrapper
> > > around two macros that could just be combined, which would reduce
> > > complexity overall.
> >
> > Done that way
> >
> > > + T_CompressionRoutine, /* in access/compressionapi.h */
> > >
> > > This looks misplaced. I guess it should go just after these:
> > >
> > > T_FdwRoutine, /* in foreign/fdwapi.h */
> > > T_IndexAmRoutine, /* in access/amapi.h */
> > > T_TableAmRoutine, /* in access/tableam.h */
> >
> > Done
> >
> > > Looking over the regression test changes:
> > >
> > > The tests at the top of create_cm.out that just test that we can
> > > create tables with various storage types seem unrelated to the purpose
> > > of the patch. And the file doesn't test creating a compression method
> > > either, as the file name would suggest, so either the file name needs
> > > to be changed (compression, compression_method?) or the tests don't go
> > > here.
> >
> > Changed to "compression"
> >
> > > +-- check data is okdd
> > >
> > > I guess whoever is responsible for this comment prefers vi to emacs.
> >
> > Fixed
> >
> > > I don't quite understand the purpose of all of these tests, and there
> > > are some things that I feel like ought to be tested that seemingly
> > > aren't. Like, you seem to test using an UPDATE to move a datum from a
> > > table to another table with the same compression method, but not one
> > > with a different compression method.
> >
> > Added test for this, and some other tests to improve overall coverage.
> >
> > Testing the former is nice and
> > > everything, but that's the easy case: I think we also need to test the
> > > latter. I think it would be good to verify not only that the data is
> > > readable but that it's compressed the way we expect. I think it would
> > > be a great idea to add a pg_column_compression() function in a similar
> > > spirit to pg_column_size(). Perhaps it could return NULL when
> > > compression is not in use or the data type is not varlena, and the
> > > name of the compression method otherwise. That would allow for better
> > > testing of this feature, and it would also be useful to users who are
> > > switching methods, to see what data they still have that's using the
> > > old method. It could be useful for debugging problems on customer
> > > systems, too.
> >
> > This is a really great idea, I have added this function and used in my test.
> >
> > > I wonder if we need a test that moves data between tables through an
> > > intermediary. For instance, suppose a plpgsql function or DO block
> > > fetches some data and stores it in a plpgsql variable and then uses
> > > the variable to insert into another table. Hmm, maybe that would force
> > > de-TOASTing. But perhaps there are other cases. Maybe a more general
> > > way to approach the problem is: have you tried running a coverage
> > > report and checked which parts of your code are getting exercised by
> > > the existing tests and which parts are not? The stuff that isn't, we
> > > should try to add more tests. It's easy to get corner cases wrong with
> > > this kind of thing.
> > >
> > > I notice that LIKE INCLUDING COMPRESSION doesn't seem to be tested, at
> > > least not by 0001, which reinforces my feeling that the tests here are
> > > not as thorough as they could be.
> >
> > Added test for this as well.
> >
> > > +NOTICE: pg_compression contains unpinned initdb-created object(s)
> >
> > > This seems wrong to me - why is it OK?
> >
> > Yeah, this is wrong, now fixed.
> >
> > > - result = (struct varlena *)
> > > - palloc(TOAST_COMPRESS_RAWSIZE(attr) + VARHDRSZ);
> > > - SET_VARSIZE(result, TOAST_COMPRESS_RAWSIZE(attr) + VARHDRSZ);
> > > + cmoid = GetCompressionOidFromCompressionId(TOAST_COMPRESS_METHOD(attr));
> > >
> > > - if (pglz_decompress(TOAST_COMPRESS_RAWDATA(attr),
> > > - TOAST_COMPRESS_SIZE(attr),
> > > - VARDATA(result),
> > > -
> > > TOAST_COMPRESS_RAWSIZE(attr), true) < 0)
> > > - elog(ERROR, "compressed data is corrupted");
> > > + /* get compression method handler routines */
> > > + cmroutine = GetCompressionRoutine(cmoid);
> > >
> > > - return result;
> > > + return cmroutine->cmdecompress(attr);
> > >
> > > I'm worried about how expensive this might be, and I think we could
> > > make it cheaper. The reason why I think this might be expensive is:
> > > currently, for every datum, you have a single direct function call.
> > > Now, with this, you first have a direct function call to
> > > GetCompressionOidFromCompressionId(). Then you have a call to
> > > GetCompressionRoutine(), which does a syscache lookup and calls a
> > > handler function, which is quite a lot more expensive than a single
> > > function call. And the handler isn't even returning a statically
> > > allocated structure, but is allocating new memory every time, which
> > > involves more function calls and maybe memory leaks. Then you use the
> > > results of all that to make an indirect function call.
> > >
> > > I'm not sure exactly what combination of things we could use to make
> > > this better, but it seems like there are a few possibilities:
> > >
> > > (1) The handler function could return a pointer to the same
> > > CompressionRoutine every time instead of constructing a new one every
> > > time.
> > > (2) The CompressionRoutine to which the handler function returns a
> > > pointer could be statically allocated instead of being built at
> > > runtime.
> > > (3) GetCompressionRoutine could have an OID -> handler cache instead
> > > of relying on syscache + calling the handler function all over again.
> > > (4) For the compression types that have dedicated bit patterns in the
> > > high bits of the compressed TOAST size, toast_compress_datum() could
> > > just have hard-coded logic to use the correct handlers instead of
> > > translating the bit pattern into an OID and then looking it up over
> > > again.
> > > (5) Going even further than #4 we could skip the handler layer
> > > entirely for such methods, and just call the right function directly.
> > > I think we should definitely do (1), and also (2) unless there's some
> > > reason it's hard. (3) doesn't need to be part of this patch, but might
> > > be something to consider later in the series. It's possible that it
> > > doesn't have enough benefit to be worth the work, though. Also, I
> > > think we should do either (4) or (5). I have a mild preference for (5)
> > > unless it looks too ugly.
> > > Note that I'm not talking about hard-coding a fast path for a
> > > hard-coded list of OIDs - which would seem a little bit unprincipled -
> > > but hard-coding a fast path for the bit patterns that are themselves
> > > hard-coded. I don't think we lose anything in terms of extensibility
> > > or even-handedness there; it's just avoiding a bunch of rigamarole
> > > that doesn't really buy us anything.
> > >
> > > All these points apply equally to toast_decompress_datum_slice() and
> > > toast_compress_datum().
> >
> > Fixed as discussed at [1]
> >
> > > + /* Fallback to default compression method, if not specified */
> > > + if (!OidIsValid(cmoid))
> > > + cmoid = DefaultCompressionOid;
> > >
> > > I think that the caller should be required to specify a legal value,
> > > and this should be an elog(ERROR) or an Assert().
> > >
> > > The change to equalTupleDescs() makes me wonder. Like, can we specify
> > > the compression method for a function parameter, or a function return
> > > value? I would think not. But then how are the tuple descriptors set
> > > up in that case? Under what circumstances do we actually need the
> > > tuple descriptors to compare unequal?
> >
> > If we alter the compression method then we check whether we need to
> > rebuild the tuple descriptor or not based on what value is changed so
> > if the attribute compression method is changed we need to rebuild the
> > compression method right. You might say that in the first patch we
> > are not allowing altering the compression method so we might move this
> > to the second patch but I thought since we added this field to
> > pg_attribute in this patch then better to add this check as well.
> > What am I missing?
> >
> > > lz4.c's header comment calls it cm_lz4.c, and the pathname is wrong too.
> > >
> > > I wonder if we should try to adopt a convention for the names of these
> > > files that isn't just the compression method name, like cmlz4 or
> > > compress_lz4. I kind of like the latter one. I am a little worried
> > > that just calling it lz4.c will result in name collisions later - not
> > > in this directory, of course, but elsewhere in the system. It's not a
> > > disaster if that happens, but for example verbose error reports print
> > > the file name, so it's nice if it's unambiguous.
> >
> > Changed to compress_lz4.
> >
> > > + if (!IsBinaryUpgrade &&
> > > + (relkind == RELKIND_RELATION ||
> > > + relkind == RELKIND_PARTITIONED_TABLE))
> > > + attr->attcompression =
> > > +
> > > GetAttributeCompressionMethod(attr, colDef->compression);
> > > + else
> > > + attr->attcompression = InvalidOid;
> > >
> > > Storing InvalidOid in the IsBinaryUpgrade case looks wrong. If
> > > upgrading from pre-v14, we need to store PGLZ_COMPRESSION_OID.
> > > Otherwise, we need to preserve whatever value was present in the old
> > > version. Or am I confused here?
> >
> > Okay, so I think we can simply remove the IsBinaryUpgrade check so it
> > will behave as expected. Basically, now it the compression method is
> > specified then it will take that compression method and if it is not
> > specified then it will take the PGLZ_COMPRESSION_OID.
> >
> > > I think there should be tests for the way this interacts with
> > > partitioning, and I think the intended interaction should be
> > > documented. Perhaps it should behave like TABLESPACE, where the parent
> > > property has no effect on what gets stored because the parent has no
> > > storage, but is inherited by each new child.
> >
> > I have added the test for this and also documented the same.
> >
> > > I wonder in passing about TOAST tables and materialized views, which
> > > are the other things that have storage. What gets stored for
> > > attcompression?
> >
> > I have changed this to store the Invalid compression method always.
> >
> > For a TOAST table it probably doesn't matter much
> > > since TOAST table entries shouldn't ever be toasted themselves, so
> > > anything that doesn't crash is fine (but maybe we should test that
> > > trying to alter the compression properties of a TOAST table doesn't
> > > crash, for example).
> >
> > You mean to update the pg_attribute table for the toasted field (e.g
> > chunk_data) and set the attcompression to something valid? Or there
> > is a better way to write this test?
> >
> > For a materialized view it seems reasonable to
> > > want to set column properties, but I'm not quite sure how that works
> > > today for things like STORAGE anyway. If we do allow setting STORAGE
> > > or COMPRESSION for materialized view columns then dump-and-reload
> > > needs to preserve the values.
> >
> > Fixed as described as [2]
> >
> > > + /*
> > > + * Use default compression method if the existing compression method is
> > > + * invalid but the new storage type is non plain storage.
> > > + */
> > > + if (!OidIsValid(attrtuple->attcompression) &&
> > > + (newstorage != TYPSTORAGE_PLAIN))
> > > + attrtuple->attcompression = DefaultCompressionOid;
> > >
> > > You have a few too many parens in there.
> >
> > Fixed
> >
> > > I don't see a particularly good reason to treat plain and external
> > > differently. More generally, I think there's a question here about
> > > when we need an attribute to have a valid compression type and when we
> > > don't. If typstorage is plan or external, then there's no point in
> > > ever having a compression type and maybe we should even reject
> > > attempts to set one (but I'm not sure). However, the attstorage is a
> > > different case. Suppose the column is created with extended storage
> > > and then later it's changed to plain. That's only a hint, so there may
> > > still be toasted values in that column, so the compression setting
> > > must endure. At any rate, we need to make sure we have clear and
> > > sensible rules for when attcompression (a) must be valid, (b) may be
> > > valid, and (c) must be invalid. And those rules need to at least be
> > > documented in the comments, and maybe in the SGML docs.
> > >
> > > I'm out of time for today, so I'll have to look at this more another
> > > day. Hope this helps for a start.
> >
> > Fixed as I have described at [2], and the rules are documented in
> > pg_attribute.h (atop attcompression field)
> >
> > [1] https://www.postgresql.org/message-id/CA%2BTgmob3W8cnLgOQX%2BJQzeyGN3eKGmRrBkUY6WGfNyHa%2Bt_qEw%40mail.gmail.com
> > [2] https://www.postgresql.org/message-id/CAFiTN-tzTTT2oqWdRGLv1dvvS5MC1W%2BLE%2B3bqWPJUZj4GnHOJg%40mail.gmail.com
> >
>
> I was working on analyzing the behavior of how the attribute merging
> should work for the compression method for an inherited child so for
> that, I was analyzing the behavior for the storage method. I found
> some behavior that doesn't seem right. Basically, while creating the
> inherited child we don't allow the storage to be different than the
> parent attribute's storage but later we are allowed to alter that, is
> that correct behavior.
>
> Here is the test case to demonstrate this.
>
> postgres[12546]=# create table t (a varchar);
> postgres[12546]=# alter table t ALTER COLUMN a SET STORAGE plain;
> postgres[12546]=# create table t1 (a varchar);
> postgres[12546]=# alter table t1 ALTER COLUMN a SET STORAGE external;
>
> /* Not allowing to set the external because parent attribute has plain */
> postgres[12546]=# create table t2 (LIKE t1 INCLUDING STORAGE) INHERITS ( t);
> NOTICE: 00000: merging column "a" with inherited definition
> LOCATION: MergeAttributes, tablecmds.c:2685
> ERROR: 42804: column "a" has a storage parameter conflict
> DETAIL: PLAIN versus EXTERNAL
> LOCATION: MergeAttributes, tablecmds.c:2730

On further analysis, IMHO the reason for this error is not that it can
not allow different storage methods for inherited child's attributes
but it is reporting error because of conflicting storage between child
and parent. For example, if we inherit a child from two-parent who
have the same attribute name with a different storage type then also
it will conflict. I know that if it conflicts between parent and
child we might give preference to the child's storage but I don't see
much problem with the current behavior also. So as of now, I have
kept the same behavior for the compression as well. I have added a
test case for the same.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment	Content-Type	Size
v16-0002-alter-table-set-compression.patch	application/octet-stream	17.9 KB
v16-0004-Create-custom-compression-methods.patch	application/octet-stream	29.3 KB
v16-0005-new-compression-method-extension-for-zlib.patch	application/octet-stream	9.9 KB
v16-0001-Built-in-compression-method.patch	application/octet-stream	224.2 KB
v16-0003-Add-support-for-PRESERVE.patch	application/octet-stream	44.0 KB
v16-0006-Support-compression-methods-options.patch	application/octet-stream	57.3 KB

From:	Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
To:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, David Steele <david(at)pgmasters(dot)net>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2020-12-27 07:10:34
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

> 25 дек. 2020 г., в 14:34, Dilip Kumar <dilipbalaut(at)gmail(dot)com> написал(а):
>
> <v16-0002-alter-table-set-compression.patch> <v16-0004-Create-custom-compression-methods.patch> <v16-0005-new-compression-method-extension-for-zlib.patch> <v16-0001-Built-in-compression-method.patch> <v16-0003-Add-support-for-PRESERVE.patch> <v16-0006-Support-compression-methods-options.patch>

Maybe add Lz4\Zlib WAL FPI compression on top of this patchset? I'm not insisting on anything, it just would be so cool to have it...

BTW currently there are Oid collisions in original patchset.

Best regards, Andrey Borodin.

Attachment	Content-Type	Size
v16-0007-Add-Lz4-compression-to-WAL-FPIs.patch	application/octet-stream	11.0 KB

From:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To:	Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, David Steele <david(at)pgmasters(dot)net>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2020-12-28 05:20:05
Message-ID:	CAFiTN-usCTFDUXwNCzAAOYnEEuO71Hq12_GZWx+sU47Ud=CS-Q@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Dec 27, 2020 at 12:40 PM Andrey Borodin <x4mmm(at)yandex-team(dot)ru> wrote:
>
>
>
> > 25 дек. 2020 г., в 14:34, Dilip Kumar <dilipbalaut(at)gmail(dot)com> написал(а):
> >
> > <v16-0002-alter-table-set-compression.patch> <v16-0004-Create-custom-compression-methods.patch> <v16-0005-new-compression-method-extension-for-zlib.patch> <v16-0001-Built-in-compression-method.patch> <v16-0003-Add-support-for-PRESERVE.patch> <v16-0006-Support-compression-methods-options.patch>
>
> Maybe add Lz4\Zlib WAL FPI compression on top of this patchset? I'm not insisting on anything, it just would be so cool to have it...
>
> BTW currently there are Oid collisions in original patchset.

Thanks for the patch. Maybe we can allow setting custom compression
methods for wal compression as well.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

From:	Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
To:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, David Steele <david(at)pgmasters(dot)net>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2020-12-28 06:14:43
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

> 28 дек. 2020 г., в 10:20, Dilip Kumar <dilipbalaut(at)gmail(dot)com> написал(а):
>
> On Sun, Dec 27, 2020 at 12:40 PM Andrey Borodin <x4mmm(at)yandex-team(dot)ru> wrote:
>>
>>
>>
>>> 25 дек. 2020 г., в 14:34, Dilip Kumar <dilipbalaut(at)gmail(dot)com> написал(а):
>>>
>>> <v16-0002-alter-table-set-compression.patch> <v16-0004-Create-custom-compression-methods.patch> <v16-0005-new-compression-method-extension-for-zlib.patch> <v16-0001-Built-in-compression-method.patch> <v16-0003-Add-support-for-PRESERVE.patch> <v16-0006-Support-compression-methods-options.patch>
>>
>> Maybe add Lz4\Zlib WAL FPI compression on top of this patchset? I'm not insisting on anything, it just would be so cool to have it...
>>
>> BTW currently there are Oid collisions in original patchset.
>
> Thanks for the patch. Maybe we can allow setting custom compression
> methods for wal compression as well.

No, unfortunately, we can't use truly custom methods. Custom compression handlers are WAL-logged. So we can use only static set of hardcoded compression methods.

Thanks!

Best regards, Andrey Borodin.

From:	Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
To:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, David Steele <david(at)pgmasters(dot)net>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2020-12-29 17:48:49
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

> 28 дек. 2020 г., в 11:14, Andrey Borodin <x4mmm(at)yandex-team(dot)ru> написал(а):
>
>> Thanks for the patch. Maybe we can allow setting custom compression
>> methods for wal compression as well.
>
> No, unfortunately, we can't use truly custom methods. Custom compression handlers are WAL-logged. So we can use only static set of hardcoded compression methods.

So, I've made some very basic benchmarks on my machine [0].
With pglz after checkpoint I observe 1146 and 1225 tps.
With lz4 I observe 1485 and 1524 tps.
Without wal_compression I see 1529 tps.

These observations can be explained with plain statement: pglz is bottleneck on my machine, lz4 is not.
While this effect can be reached with other means [1] I believe having lz4 for WAL FPIs would be much more CPU efficient.

FPA lz4 for WAL FPI patch v17. Changes: fixed some frontend issues, added some comments.

Best regards, Andrey Borodin.

[0] https://yadi.sk/d/6y5YiROXQRkoEw
[1] https://www.postgresql.org/message-id/flat/25991595-1848-4178-AA57-872B10309DA2%40yandex-team.ru#e7bb0e048358bcff281011dcf115ad42

Attachment	Content-Type	Size
v17-0007-Add-Lz4-compression-to-WAL-FPIs.patch	application/octet-stream	11.9 KB

From:	Justin Pryzby <pryzby(at)telsasoft(dot)com>
To:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, David Steele <david(at)pgmasters(dot)net>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: [HACKERS] Custom compression methods
Date:	2021-01-04 01:22:20
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

The most recent patch doesn't compile --without-lz4:

compress_lz4.c:191:17: error: ‘lz4_cmcheck’ undeclared here (not in a function)
.datum_check = lz4_cmcheck,
...

And fails pg_upgrade check, apparently losing track of the compression (?)

CREATE TABLE public.cmdata2 (
- f1 text COMPRESSION lz4
+ f1 text
);

You added pg_dump --no-compression, but the --help isn't updated. I think
there should also be an option for pg_restore, like --no-tablespaces. And I
think there should be a GUC for default_compression, like
default_table_access_method, so one can restore into an alternate compression
by setting PGOPTIONS=-cdefault_compression=lz4.

I'd like to be able to make all compressible columns of a table use a
non-default compression (except those which cannot), without having to use
\gexec... We have tables with up to 1600 columns. So a GUC would allow that.

Previously (on separate threads) I wondered whether pg_dump
--no-table-access-method was needed - maybe that be sufficient for this case,
too, but I think it should be possible to separately avoid restoring
compression AM and AM "proper". So maybe it'd be like --no-tableam=compress
--no-tableam=storage or --no-tableam-all.

Some language fixes:

Subject: [PATCH v16 1/6] Built-in compression method

+++ b/doc/src/sgml/ddl.sgml
@@ -3762,6 +3762,8 @@ CREATE TABLE measurement (
<productname>PostgreSQL</productname>
tables (or, possibly, foreign tables). It is possible to specify a
tablespace and storage parameters for each partition separately.
+ Partitions inherits the compression method of the parent for each column
+ however we can set different compression method for each partition.

Should say:
+ By default, each column in a partition inherits the compression method from its parent table,
+ however a different compression method can be set for each partition.

+++ b/doc/src/sgml/ref/create_table.sgml

+ <varlistentry>
+ <term><literal>INCLUDING COMPRESSION</literal></term>
+ <listitem>
+ <para>
+ Compression method of the columns will be coppied. The default
+ behavior is to exclude compression method, resulting in the copied
+ column will have the default compression method if the column type is
+ compressible.

Say:
+ Compression method of the columns will be copied. The default
+ behavior is to exclude compression methods, resulting in the
+ columns having the default compression method.

+ <varlistentry>
+ <term><literal>COMPRESSION <replaceable class="parameter">compression_method</replaceable></literal></term>
+ <listitem>
+ <para>
+ This clause adds the compression method to a column. Compression method
+ can be set from the available built-in compression methods. The available
+ options are <literal>pglz</literal> and <literal>lz4</literal>. If the
+ compression method is not sepcified for the compressible type then it will
+ have the default compression method. The default compression method is
+ <literal>pglz</literal>.

Say "The compression method can be set from available compression methods" (or
remove this sentence).
Say "The available BUILT-IN methods are ..."
sepcified => specified

+
+ /*
+ * No point in wasting a palloc cycle if value size is out of the allowed
+ * range for compression

say "outside the allowed range"

+ if (pset.sversion >= 120000 &&
+ if (pset.sversion >= 120000 &&

A couple places that need to say >= 14

Subject: [PATCH v16 2/6] alter table set compression

+ <literal>SET COMPRESSION <replaceable class="parameter">compression_method</replaceable></literal>
+ This clause adds compression to a column. Compression method can be set
+ from available built-in compression methods. The available built-in
+ methods are <literal>pglz</literal> and <literal>lz4</literal>.

Should say "The compression method can be set to any available method. The
built in methods are >PGLZ< or >LZ<"
That fixes grammar, and correction that it's possible to set to an available
method other than what's "built-in".

+++ b/src/include/commands/event_trigger.h
@@ -32,7 +32,7 @@ typedef struct EventTriggerData
#define AT_REWRITE_ALTER_PERSISTENCE 0x01
#define AT_REWRITE_DEFAULT_VAL 0x02
#define AT_REWRITE_COLUMN_REWRITE 0x04
-
+#define AT_REWRITE_ALTER_COMPRESSION 0x08
/*

This is losing a useful newline.

Subject: [PATCH v16 4/6] Create custom compression methods

+ This clause adds compression to a column. Compression method
+ could be created with <xref linkend="sql-create-access-method"/> or it can
+ be set from the available built-in compression methods. The available
+ built-in methods are <literal>pglz</literal> and <literal>lz4</literal>.
+ The PRESERVE list contains list of compression methods used on the column
+ and determines which of them should be kept on the column. Without
+ PRESERVE or if all the previous compression methods are not preserved then
+ the table will be rewritten. If PRESERVE ALL is specified then all the
+ previous methods will be preserved and the table will not be rewritten.
</para>
</listitem>
</varlistentry>

diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
index f404dd1088..ade3989d75 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -999,11 +999,12 @@ WITH ( MODULUS <replaceable class="parameter">numeric_literal</replaceable>, REM
+ could be created with <xref linkend="sql-create-access-method"/> or it can
+ be set from the available built-in compression methods. The available

remove this first "built-in" ?

+ built-in methods are <literal>pglz</literal> and <literal>lz4</literal>.

+GetCompressionAmRoutineByAmId(Oid amoid)
...
+ /* Check if it's an index access method as opposed to some other AM */
+ if (amform->amtype != AMTYPE_COMPRESSION)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("access method \"%s\" is not of type %s",
+ NameStr(amform->amname), "INDEX")));
...
+ errmsg("index access method \"%s\" does not have a handler",

In 3 places, the comment and code should say "COMPRESSION" right ?

Subject: [PATCH v16 6/6] Support compression methods options

+ If compression method has options they could be specified with
+ <literal>WITH</literal> parameter.

If *the* compression method has options, they *can* be specified with *the* ...

@@ -1004,7 +1004,9 @@ WITH ( MODULUS <replaceable class="parameter">numeric_literal</replaceable>, REM
+ method is <literal>pglz</literal>. If the compression method has options
+ they could be specified by <literal>WITH</literal>
+ parameter.

same

+static void *
+lz4_cminitstate(List *options)
+{
+ int32 *acceleration = palloc(sizeof(int32));
+
+ /* initialize with the default acceleration */
+ *acceleration = 1;
+
+ if (list_length(options) > 0)
+ {
+ ListCell *lc;
+
+ foreach(lc, options)
+ {
+ DefElem *def = (DefElem *) lfirst(lc);
+
+ if (strcmp(def->defname, "acceleration") == 0)
+ *acceleration = pg_atoi(defGetString(def), sizeof(int32), 0);

Don't you need to say "else: error: unknown compression option" ?

+ /*
+ * Compression option must be only valid if we are updating the compression
+ * method.
+ */
+ Assert(DatumGetPointer(acoptions) == NULL || OidIsValid(newcompression));
+

should say "need be valid only if .."

--
Justin

From:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To:	Justin Pryzby <pryzby(at)telsasoft(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, David Steele <david(at)pgmasters(dot)net>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2021-01-04 06:52:36
Message-ID:	CAFiTN-svAHmNshc1ozi5aoD-FKXQjDc9K6Wb3CcdSYQt0QDiNQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Jan 4, 2021 at 6:52 AM Justin Pryzby <pryzby(at)telsasoft(dot)com> wrote:
>
> The most recent patch doesn't compile --without-lz4:
>
> compress_lz4.c:191:17: error: ‘lz4_cmcheck’ undeclared here (not in a function)
> .datum_check = lz4_cmcheck,
> ...
>
> And fails pg_upgrade check, apparently losing track of the compression (?)
>
> CREATE TABLE public.cmdata2 (
> - f1 text COMPRESSION lz4
> + f1 text
> );
>
> You added pg_dump --no-compression, but the --help isn't updated. I think
> there should also be an option for pg_restore, like --no-tablespaces. And I
> think there should be a GUC for default_compression, like
> default_table_access_method, so one can restore into an alternate compression
> by setting PGOPTIONS=-cdefault_compression=lz4.
>
> I'd like to be able to make all compressible columns of a table use a
> non-default compression (except those which cannot), without having to use
> \gexec... We have tables with up to 1600 columns. So a GUC would allow that.
>
> Previously (on separate threads) I wondered whether pg_dump
> --no-table-access-method was needed - maybe that be sufficient for this case,
> too, but I think it should be possible to separately avoid restoring
> compression AM and AM "proper". So maybe it'd be like --no-tableam=compress
> --no-tableam=storage or --no-tableam-all.
>
> Some language fixes:
>
> Subject: [PATCH v16 1/6] Built-in compression method
>
> +++ b/doc/src/sgml/ddl.sgml
> @@ -3762,6 +3762,8 @@ CREATE TABLE measurement (
> <productname>PostgreSQL</productname>
> tables (or, possibly, foreign tables). It is possible to specify a
> tablespace and storage parameters for each partition separately.
> + Partitions inherits the compression method of the parent for each column
> + however we can set different compression method for each partition.
>
> Should say:
> + By default, each column in a partition inherits the compression method from its parent table,
> + however a different compression method can be set for each partition.
>
> +++ b/doc/src/sgml/ref/create_table.sgml
>
> + <varlistentry>
> + <term><literal>INCLUDING COMPRESSION</literal></term>
> + <listitem>
> + <para>
> + Compression method of the columns will be coppied. The default
> + behavior is to exclude compression method, resulting in the copied
> + column will have the default compression method if the column type is
> + compressible.
>
> Say:
> + Compression method of the columns will be copied. The default
> + behavior is to exclude compression methods, resulting in the
> + columns having the default compression method.
>
> + <varlistentry>
> + <term><literal>COMPRESSION <replaceable class="parameter">compression_method</replaceable></literal></term>
> + <listitem>
> + <para>
> + This clause adds the compression method to a column. Compression method
> + can be set from the available built-in compression methods. The available
> + options are <literal>pglz</literal> and <literal>lz4</literal>. If the
> + compression method is not sepcified for the compressible type then it will
> + have the default compression method. The default compression method is
> + <literal>pglz</literal>.
>
> Say "The compression method can be set from available compression methods" (or
> remove this sentence).
> Say "The available BUILT-IN methods are ..."
> sepcified => specified
>
> +
> + /*
> + * No point in wasting a palloc cycle if value size is out of the allowed
> + * range for compression
>
> say "outside the allowed range"
>
> + if (pset.sversion >= 120000 &&
> + if (pset.sversion >= 120000 &&
>
> A couple places that need to say >= 14
>
> Subject: [PATCH v16 2/6] alter table set compression
>
> + <literal>SET COMPRESSION <replaceable class="parameter">compression_method</replaceable></literal>
> + This clause adds compression to a column. Compression method can be set
> + from available built-in compression methods. The available built-in
> + methods are <literal>pglz</literal> and <literal>lz4</literal>.
>
> Should say "The compression method can be set to any available method. The
> built in methods are >PGLZ< or >LZ<"
> That fixes grammar, and correction that it's possible to set to an available
> method other than what's "built-in".
>
> +++ b/src/include/commands/event_trigger.h
> @@ -32,7 +32,7 @@ typedef struct EventTriggerData
> #define AT_REWRITE_ALTER_PERSISTENCE 0x01
> #define AT_REWRITE_DEFAULT_VAL 0x02
> #define AT_REWRITE_COLUMN_REWRITE 0x04
> -
> +#define AT_REWRITE_ALTER_COMPRESSION 0x08
> /*
>
> This is losing a useful newline.
>
> Subject: [PATCH v16 4/6] Create custom compression methods
>
> + This clause adds compression to a column. Compression method
> + could be created with <xref linkend="sql-create-access-method"/> or it can
> + be set from the available built-in compression methods. The available
> + built-in methods are <literal>pglz</literal> and <literal>lz4</literal>.
> + The PRESERVE list contains list of compression methods used on the column
> + and determines which of them should be kept on the column. Without
> + PRESERVE or if all the previous compression methods are not preserved then
> + the table will be rewritten. If PRESERVE ALL is specified then all the
> + previous methods will be preserved and the table will not be rewritten.
> </para>
> </listitem>
> </varlistentry>
>
> diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
> index f404dd1088..ade3989d75 100644
> --- a/doc/src/sgml/ref/create_table.sgml
> +++ b/doc/src/sgml/ref/create_table.sgml
> @@ -999,11 +999,12 @@ WITH ( MODULUS <replaceable class="parameter">numeric_literal</replaceable>, REM
> + could be created with <xref linkend="sql-create-access-method"/> or it can
> + be set from the available built-in compression methods. The available
>
> remove this first "built-in" ?
>
> + built-in methods are <literal>pglz</literal> and <literal>lz4</literal>.
>
>
> +GetCompressionAmRoutineByAmId(Oid amoid)
> ...
> + /* Check if it's an index access method as opposed to some other AM */
> + if (amform->amtype != AMTYPE_COMPRESSION)
> + ereport(ERROR,
> + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> + errmsg("access method \"%s\" is not of type %s",
> + NameStr(amform->amname), "INDEX")));
> ...
> + errmsg("index access method \"%s\" does not have a handler",
>
> In 3 places, the comment and code should say "COMPRESSION" right ?
>
> Subject: [PATCH v16 6/6] Support compression methods options
>
> + If compression method has options they could be specified with
> + <literal>WITH</literal> parameter.
>
> If *the* compression method has options, they *can* be specified with *the* ...
>
> @@ -1004,7 +1004,9 @@ WITH ( MODULUS <replaceable class="parameter">numeric_literal</replaceable>, REM
> + method is <literal>pglz</literal>. If the compression method has options
> + they could be specified by <literal>WITH</literal>
> + parameter.
>
> same
>
> +static void *
> +lz4_cminitstate(List *options)
> +{
> + int32 *acceleration = palloc(sizeof(int32));
> +
> + /* initialize with the default acceleration */
> + *acceleration = 1;
> +
> + if (list_length(options) > 0)
> + {
> + ListCell *lc;
> +
> + foreach(lc, options)
> + {
> + DefElem *def = (DefElem *) lfirst(lc);
> +
> + if (strcmp(def->defname, "acceleration") == 0)
> + *acceleration = pg_atoi(defGetString(def), sizeof(int32), 0);
>
> Don't you need to say "else: error: unknown compression option" ?
>
> + /*
> + * Compression option must be only valid if we are updating the compression
> + * method.
> + */
> + Assert(DatumGetPointer(acoptions) == NULL || OidIsValid(newcompression));
> +
>
> should say "need be valid only if .."
>

Thanks for the review, I will work on these and respond along with the
updated patches.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

From:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To:	Justin Pryzby <pryzby(at)telsasoft(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, David Steele <david(at)pgmasters(dot)net>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2021-01-04 11:27:16
Message-ID:	CAFiTN-utJPbTLQ9i10wT_zmHX=un+RQMB1B1xbkTgrh971vqjw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

My bad, fixed this.

> And fails pg_upgrade check, apparently losing track of the compression (?)
>
> CREATE TABLE public.cmdata2 (
> - f1 text COMPRESSION lz4
> + f1 text
> );

I did not get this? pg_upgrade check is passing for me.

> You added pg_dump --no-compression, but the --help isn't updated.

Fixed.

I think
> there should also be an option for pg_restore, like --no-tablespaces. And I
> think there should be a GUC for default_compression, like
> default_table_access_method, so one can restore into an alternate compression
> by setting PGOPTIONS=-cdefault_compression=lz4.
>
> I'd like to be able to make all compressible columns of a table use a
> non-default compression (except those which cannot), without having to use
> \gexec... We have tables with up to 1600 columns. So a GUC would allow that.
>
> Previously (on separate threads) I wondered whether pg_dump
> --no-table-access-method was needed - maybe that be sufficient for this case,
> too, but I think it should be possible to separately avoid restoring
> compression AM and AM "proper". So maybe it'd be like --no-tableam=compress
> --no-tableam=storage or --no-tableam-all.

I will put more thought into this and respond separately.

> Some language fixes:
>
> Subject: [PATCH v16 1/6] Built-in compression method
>
> +++ b/doc/src/sgml/ddl.sgml
> @@ -3762,6 +3762,8 @@ CREATE TABLE measurement (
> <productname>PostgreSQL</productname>
> tables (or, possibly, foreign tables). It is possible to specify a
> tablespace and storage parameters for each partition separately.
> + Partitions inherits the compression method of the parent for each column
> + however we can set different compression method for each partition.
>
> Should say:
> + By default, each column in a partition inherits the compression method from its parent table,
> + however a different compression method can be set for each partition.

Done

> +++ b/doc/src/sgml/ref/create_table.sgml
>
> + <varlistentry>
> + <term><literal>INCLUDING COMPRESSION</literal></term>
> + <listitem>
> + <para>
> + Compression method of the columns will be coppied. The default
> + behavior is to exclude compression method, resulting in the copied
> + column will have the default compression method if the column type is
> + compressible.
>
> Say:
> + Compression method of the columns will be copied. The default
> + behavior is to exclude compression methods, resulting in the
> + columns having the default compression method.

Done

> + <varlistentry>
> + <term><literal>COMPRESSION <replaceable class="parameter">compression_method</replaceable></literal></term>
> + <listitem>
> + <para>
> + This clause adds the compression method to a column. Compression method
> + can be set from the available built-in compression methods. The available
> + options are <literal>pglz</literal> and <literal>lz4</literal>. If the
> + compression method is not sepcified for the compressible type then it will
> + have the default compression method. The default compression method is
> + <literal>pglz</literal>.
>
> Say "The compression method can be set from available compression methods" (or
> remove this sentence).
> Say "The available BUILT-IN methods are ..."
> sepcified => specified

Done

> +
> + /*
> + * No point in wasting a palloc cycle if value size is out of the allowed
> + * range for compression
>
> say "outside the allowed range"
>
> + if (pset.sversion >= 120000 &&
> + if (pset.sversion >= 120000 &&
>
> A couple places that need to say >= 14

Fixed

> Subject: [PATCH v16 2/6] alter table set compression
>
> + <literal>SET COMPRESSION <replaceable class="parameter">compression_method</replaceable></literal>
> + This clause adds compression to a column. Compression method can be set
> + from available built-in compression methods. The available built-in
> + methods are <literal>pglz</literal> and <literal>lz4</literal>.
>
> Should say "The compression method can be set to any available method. The
> built in methods are >PGLZ< or >LZ<"
> That fixes grammar, and correction that it's possible to set to an available
> method other than what's "built-in".

Done

> +++ b/src/include/commands/event_trigger.h
> @@ -32,7 +32,7 @@ typedef struct EventTriggerData
> #define AT_REWRITE_ALTER_PERSISTENCE 0x01
> #define AT_REWRITE_DEFAULT_VAL 0x02
> #define AT_REWRITE_COLUMN_REWRITE 0x04
> -
> +#define AT_REWRITE_ALTER_COMPRESSION 0x08
> /*
>
> This is losing a useful newline.

Fixed

> Subject: [PATCH v16 4/6] Create custom compression methods
>
> + This clause adds compression to a column. Compression method
> + could be created with <xref linkend="sql-create-access-method"/> or it can
> + be set from the available built-in compression methods. The available
> + built-in methods are <literal>pglz</literal> and <literal>lz4</literal>.
> + The PRESERVE list contains list of compression methods used on the column
> + and determines which of them should be kept on the column. Without
> + PRESERVE or if all the previous compression methods are not preserved then
> + the table will be rewritten. If PRESERVE ALL is specified then all the
> + previous methods will be preserved and the table will not be rewritten.
> </para>
> </listitem>
> </varlistentry>
>
> diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
> index f404dd1088..ade3989d75 100644
> --- a/doc/src/sgml/ref/create_table.sgml
> +++ b/doc/src/sgml/ref/create_table.sgml
> @@ -999,11 +999,12 @@ WITH ( MODULUS <replaceable class="parameter">numeric_literal</replaceable>, REM
> + could be created with <xref linkend="sql-create-access-method"/> or it can
> + be set from the available built-in compression methods. The available
>
> remove this first "built-in" ?

Done

> + built-in methods are <literal>pglz</literal> and <literal>lz4</literal>.
>
>
> +GetCompressionAmRoutineByAmId(Oid amoid)
> ...
> + /* Check if it's an index access method as opposed to some other AM */
> + if (amform->amtype != AMTYPE_COMPRESSION)
> + ereport(ERROR,
> + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> + errmsg("access method \"%s\" is not of type %s",
> + NameStr(amform->amname), "INDEX")));
> ...
> + errmsg("index access method \"%s\" does not have a handler",
>
> In 3 places, the comment and code should say "COMPRESSION" right ?

Fixed, along with some other refactoring around this code.

> Subject: [PATCH v16 6/6] Support compression methods options
>
> + If compression method has options they could be specified with
> + <literal>WITH</literal> parameter.
>
> If *the* compression method has options, they *can* be specified with *the* ...

Done

> @@ -1004,7 +1004,9 @@ WITH ( MODULUS <replaceable class="parameter">numeric_literal</replaceable>, REM
> + method is <literal>pglz</literal>. If the compression method has options
> + they could be specified by <literal>WITH</literal>
> + parameter.
>
> same

Done

> +static void *
> +lz4_cminitstate(List *options)
> +{
> + int32 *acceleration = palloc(sizeof(int32));
> +
> + /* initialize with the default acceleration */
> + *acceleration = 1;
> +
> + if (list_length(options) > 0)
> + {
> + ListCell *lc;
> +
> + foreach(lc, options)
> + {
> + DefElem *def = (DefElem *) lfirst(lc);
> +
> + if (strcmp(def->defname, "acceleration") == 0)
> + *acceleration = pg_atoi(defGetString(def), sizeof(int32), 0);
>
> Don't you need to say "else: error: unknown compression option" ?

Done

> + /*
> + * Compression option must be only valid if we are updating the compression
> + * method.
> + */
> + Assert(DatumGetPointer(acoptions) == NULL || OidIsValid(newcompression));
> +
>
> should say "need be valid only if .."

Changed.

Apart from this, I have also done some refactoring and comment improvement.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment	Content-Type	Size
v17-0003-Add-support-for-PRESERVE.patch	application/octet-stream	45.9 KB
v17-0001-Built-in-compression-method.patch	application/octet-stream	224.9 KB
v17-0002-alter-table-set-compression.patch	application/octet-stream	17.8 KB
v17-0004-Create-custom-compression-methods.patch	application/octet-stream	31.7 KB
v17-0005-new-compression-method-extension-for-zlib.patch	application/octet-stream	10.0 KB
v17-0006-Support-compression-methods-options.patch	application/octet-stream	62.7 KB

From:	Justin Pryzby <pryzby(at)telsasoft(dot)com>
To:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, David Steele <david(at)pgmasters(dot)net>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2021-01-10 17:29:45
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Jan 04, 2021 at 04:57:16PM +0530, Dilip Kumar wrote:
> On Mon, Jan 4, 2021 at 6:52 AM Justin Pryzby <pryzby(at)telsasoft(dot)com> wrote:
> > And fails pg_upgrade check, apparently losing track of the compression (?)
> >
> > CREATE TABLE public.cmdata2 (
> > - f1 text COMPRESSION lz4
> > + f1 text
> > );
>
> I did not get this? pg_upgrade check is passing for me.

I realized that this was failing in your v16 patch sent Dec 25.
It's passing on current patches because they do "DROP TABLE cmdata2", but
that's only masking the error.

I think this patch needs to be specifically concerned with pg_upgrade, so I
suggest to not drop your tables and MVs, to allow the pg_upgrade test to check
them. That exposes this issue:

pg_dump: error: Error message from server: ERROR: cache lookup failed for access method 36447
pg_dump: error: The command was: COPY public.cmdata (f1) TO stdout;
pg_dumpall: error: pg_dump failed on database "regression", exiting
waiting for server to shut down.... done
server stopped
pg_dumpall of post-upgrade database cluster failed

But in the new cluster, the OID has changed. Since that's written into table
data, I think you have to ensure that the compression OIDs are preserved on
upgrade:

16755 | pglz2 | pglzhandler | c

In my brief attempt to inspect it, I got this crash:

$ tmp_install/usr/local/pgsql/bin/postgres -D src/bin/pg_upgrade/tmp_check/data &
regression=# SELECT pg_column_compression(f1) FROM cmdata a;
server closed the connection unexpectedly

Thread 1 "postgres" received signal SIGSEGV, Segmentation fault.
__strlen_sse2 () at ../sysdeps/x86_64/multiarch/../strlen.S:120
120 ../sysdeps/x86_64/multiarch/../strlen.S: No such file or directory.
(gdb) bt
#0 __strlen_sse2 () at ../sysdeps/x86_64/multiarch/../strlen.S:120
#1 0x000055c6049fde62 in cstring_to_text (s=0x0) at varlena.c:193
#2 pg_column_compression () at varlena.c:5335

(gdb) up
#2 pg_column_compression () at varlena.c:5335
5335 PG_RETURN_TEXT_P(cstring_to_text(get_am_name(
(gdb) l
5333 varvalue = (struct varlena *) DatumGetPointer(value);
5334
5335 PG_RETURN_TEXT_P(cstring_to_text(get_am_name(
5336 toast_get_compression_oid(varvalue))));

I guess a missing AM here is a "shouldn't happen" case, but I'd prefer it to be
caught with an elog() (maybe in get_am_name()) or at least an Assert.

--
Justin

From:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To:	Justin Pryzby <pryzby(at)telsasoft(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, David Steele <david(at)pgmasters(dot)net>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2021-01-11 05:30:14
Message-ID:	CAFiTN-uOp-P_Kt3NMQ9Vt2Vk4RGkFvjyN_dRbOXca7nydC9m9Q@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Jan 10, 2021 at 10:59 PM Justin Pryzby <pryzby(at)telsasoft(dot)com> wrote:
>
> On Mon, Jan 04, 2021 at 04:57:16PM +0530, Dilip Kumar wrote:
> > On Mon, Jan 4, 2021 at 6:52 AM Justin Pryzby <pryzby(at)telsasoft(dot)com> wrote:
> > > And fails pg_upgrade check, apparently losing track of the compression (?)
> > >
> > > CREATE TABLE public.cmdata2 (
> > > - f1 text COMPRESSION lz4
> > > + f1 text
> > > );
> >
> > I did not get this? pg_upgrade check is passing for me.
>
> I realized that this was failing in your v16 patch sent Dec 25.
> It's passing on current patches because they do "DROP TABLE cmdata2", but
> that's only masking the error.
>
> I think this patch needs to be specifically concerned with pg_upgrade, so I
> suggest to not drop your tables and MVs, to allow the pg_upgrade test to check
> them. That exposes this issue:

Thanks for the suggestion I will try this.

> pg_dump: error: Error message from server: ERROR: cache lookup failed for access method 36447
> pg_dump: error: The command was: COPY public.cmdata (f1) TO stdout;
> pg_dumpall: error: pg_dump failed on database "regression", exiting
> waiting for server to shut down.... done
> server stopped
> pg_dumpall of post-upgrade database cluster failed
>
> I found that's the AM's OID in the old clsuter:
> regression=# SELECT * FROM pg_am WHERE oid=36447;
> oid | amname | amhandler | amtype
> -------+--------+-------------+--------
> 36447 | pglz2 | pglzhandler | c
>
> But in the new cluster, the OID has changed. Since that's written into table
> data, I think you have to ensure that the compression OIDs are preserved on
> upgrade:
>
> 16755 | pglz2 | pglzhandler | c

Yeah, basically we are storing am oid in the compressed data so Oid
must be preserved. I will look into this and fix it.

> In my brief attempt to inspect it, I got this crash:
>
> $ tmp_install/usr/local/pgsql/bin/postgres -D src/bin/pg_upgrade/tmp_check/data &
> regression=# SELECT pg_column_compression(f1) FROM cmdata a;
> server closed the connection unexpectedly
>
> Thread 1 "postgres" received signal SIGSEGV, Segmentation fault.
> __strlen_sse2 () at ../sysdeps/x86_64/multiarch/../strlen.S:120
> 120 ../sysdeps/x86_64/multiarch/../strlen.S: No such file or directory.
> (gdb) bt
> #0 __strlen_sse2 () at ../sysdeps/x86_64/multiarch/../strlen.S:120
> #1 0x000055c6049fde62 in cstring_to_text (s=0x0) at varlena.c:193
> #2 pg_column_compression () at varlena.c:5335
>
> (gdb) up
> #2 pg_column_compression () at varlena.c:5335
> 5335 PG_RETURN_TEXT_P(cstring_to_text(get_am_name(
> (gdb) l
> 5333 varvalue = (struct varlena *) DatumGetPointer(value);
> 5334
> 5335 PG_RETURN_TEXT_P(cstring_to_text(get_am_name(
> 5336 toast_get_compression_oid(varvalue))));
>
> I guess a missing AM here is a "shouldn't happen" case, but I'd prefer it to be
> caught with an elog() (maybe in get_am_name()) or at least an Assert.

Yeah, this makes sense.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

From:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To:	Justin Pryzby <pryzby(at)telsasoft(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, David Steele <david(at)pgmasters(dot)net>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2021-01-11 06:41:54
Message-ID:	CAFiTN-shBZoHjb=C5L6j9ddKcn533VUuLrFsX6Z2TCiAiSx+6w@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Jan 11, 2021 at 11:00 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> On Sun, Jan 10, 2021 at 10:59 PM Justin Pryzby <pryzby(at)telsasoft(dot)com> wrote:
> >
> > On Mon, Jan 04, 2021 at 04:57:16PM +0530, Dilip Kumar wrote:
> > > On Mon, Jan 4, 2021 at 6:52 AM Justin Pryzby <pryzby(at)telsasoft(dot)com> wrote:
> > > > And fails pg_upgrade check, apparently losing track of the compression (?)
> > > >
> > > > CREATE TABLE public.cmdata2 (
> > > > - f1 text COMPRESSION lz4
> > > > + f1 text
> > > > );
> > >
> > > I did not get this? pg_upgrade check is passing for me.
> >
> > I realized that this was failing in your v16 patch sent Dec 25.
> > It's passing on current patches because they do "DROP TABLE cmdata2", but
> > that's only masking the error.

I tested specifically pg_upgrade by removing all the DROP table and MV
and it is passing. I don't see the reason why should it fail. I mean
after the upgrade why COMPRESSION lz4 is missing?

> > I think this patch needs to be specifically concerned with pg_upgrade, so I
> > suggest to not drop your tables and MVs, to allow the pg_upgrade test to check
> > them. That exposes this issue:
>
> Thanks for the suggestion I will try this.
>
> > pg_dump: error: Error message from server: ERROR: cache lookup failed for access method 36447
> > pg_dump: error: The command was: COPY public.cmdata (f1) TO stdout;
> > pg_dumpall: error: pg_dump failed on database "regression", exiting
> > waiting for server to shut down.... done
> > server stopped
> > pg_dumpall of post-upgrade database cluster failed
> >
> > I found that's the AM's OID in the old clsuter:
> > regression=# SELECT * FROM pg_am WHERE oid=36447;
> > oid | amname | amhandler | amtype
> > -------+--------+-------------+--------
> > 36447 | pglz2 | pglzhandler | c
> >
> > But in the new cluster, the OID has changed. Since that's written into table
> > data, I think you have to ensure that the compression OIDs are preserved on
> > upgrade:
> >
> > 16755 | pglz2 | pglzhandler | c
>
> Yeah, basically we are storing am oid in the compressed data so Oid
> must be preserved. I will look into this and fix it.

On further analysis, if we are dumping and restoring then we will
compress the data back while inserting it so why would we need to old
OID. I mean in the new cluster we are inserting data again so it will
be compressed again and now it will store the new OID. Am I missing
something here?

> > In my brief attempt to inspect it, I got this crash:
> >
> > $ tmp_install/usr/local/pgsql/bin/postgres -D src/bin/pg_upgrade/tmp_check/data &
> > regression=# SELECT pg_column_compression(f1) FROM cmdata a;
> > server closed the connection unexpectedly

I tried to test this after the upgrade but I can get the proper value.

Laptop309pnin:bin dilipkumar$ ./pg_ctl -D
/Users/dilipkumar/Documents/PG/custom_compression/src/bin/pg_upgrade/tmp_check/data.old/
start
waiting for server to start....2021-01-11 11:53:28.153 IST [43412]
LOG: starting PostgreSQL 14devel on x86_64-apple-darwin19.6.0,
compiled by Apple clang version 11.0.3 (clang-1103.0.32.62), 64-bit
2021-01-11 11:53:28.170 IST [43412] LOG: database system is ready to
accept connections
done
server started

Laptop309pnin:bin dilipkumar$ ./psql -d regression
regression[43421]=# SELECT pg_column_compression(f1) FROM cmdata a;
pg_column_compression
-----------------------
lz4
lz4
pglz2
(3 rows)

Manual test: (dump and load on the new cluster)
---------------
postgres[43903]=# CREATE ACCESS METHOD pglz2 TYPE COMPRESSION HANDLER
pglzhandler;
CREATE ACCESS METHOD

postgres[43903]=# select oid from pg_am where amname='pglz2';
oid
-------
16384
(1 row)

postgres[43903]=# CREATE TABLE cmdata_test(f1 text COMPRESSION pglz2);
CREATE TABLE
postgres[43903]=# INSERT INTO cmdata_test
VALUES(repeat('1234567890',1000));
INSERT 0 1
postgres[43903]=# SELECT pg_column_compression(f1) FROM cmdata_test;
pg_column_compression
-----------------------
pglz2
(1 row)

Laptop309pnin:bin dilipkumar$ ./pg_dump -d postgres > 1.sql

—restore on new cluster—
postgres[44030]=# select oid from pg_am where amname='pglz2';
oid
-------
16385
(1 row)

postgres[44030]=# SELECT pg_column_compression(f1) FROM cmdata_test;
pg_column_compression
-----------------------
pglz2
(1 row)

You can see on the new cluster the OID of the pglz2 is changed but
there is no issue. Is it possible for you to give me a
self-contained test case to reproduce the issue or a theory that why
it should fail?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

From:	Justin Pryzby <pryzby(at)telsasoft(dot)com>
To:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, David Steele <david(at)pgmasters(dot)net>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2021-01-11 06:51:16
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Jan 11, 2021 at 12:11:54PM +0530, Dilip Kumar wrote:
> On Mon, Jan 11, 2021 at 11:00 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> > On Sun, Jan 10, 2021 at 10:59 PM Justin Pryzby <pryzby(at)telsasoft(dot)com> wrote:
> > >
> > > On Mon, Jan 04, 2021 at 04:57:16PM +0530, Dilip Kumar wrote:
> > > > On Mon, Jan 4, 2021 at 6:52 AM Justin Pryzby <pryzby(at)telsasoft(dot)com> wrote:
> > > > > And fails pg_upgrade check, apparently losing track of the compression (?)
> > > > >
> > > > > CREATE TABLE public.cmdata2 (
> > > > > - f1 text COMPRESSION lz4
> > > > > + f1 text
> > > > > );
> > > >
> > > > I did not get this? pg_upgrade check is passing for me.
> > >
> > > I realized that this was failing in your v16 patch sent Dec 25.
> > > It's passing on current patches because they do "DROP TABLE cmdata2", but
> > > that's only masking the error.
>
> I tested specifically pg_upgrade by removing all the DROP table and MV
> and it is passing. I don't see the reason why should it fail. I mean
> after the upgrade why COMPRESSION lz4 is missing?

How did you test it ?

I'm not completely clear how this is intended to work... has it been tested
before ? According to the comments, in binary upgrade mode, there's an ALTER
which is supposed to SET COMPRESSION, but that's evidently not happening.

> > > I found that's the AM's OID in the old clsuter:
> > > regression=# SELECT * FROM pg_am WHERE oid=36447;
> > > oid | amname | amhandler | amtype
> > > -------+--------+-------------+--------
> > > 36447 | pglz2 | pglzhandler | c
> > >
> > > But in the new cluster, the OID has changed. Since that's written into table
> > > data, I think you have to ensure that the compression OIDs are preserved on
> > > upgrade:
> > >
> > > 16755 | pglz2 | pglzhandler | c
> >
> > Yeah, basically we are storing am oid in the compressed data so Oid
> > must be preserved. I will look into this and fix it.
>
> On further analysis, if we are dumping and restoring then we will
> compress the data back while inserting it so why would we need to old
> OID. I mean in the new cluster we are inserting data again so it will
> be compressed again and now it will store the new OID. Am I missing
> something here?

I'm referring to pg_upgrade which uses pg_dump, but does *not* re-insert data,
but rather recreates catalogs only and then links to the old tables (either
with copy, link, or clone). Test with make -C src/bin/pg_upgrade (which is
included in make check-world).

--
Justin

From:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To:	Justin Pryzby <pryzby(at)telsasoft(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, David Steele <david(at)pgmasters(dot)net>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2021-01-11 10:10:47
Message-ID:	CAFiTN-vzRz+msFYHhi6ysVAOvg5E2o9YRbEJS41zNFmkxTo+=w@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Jan 11, 2021 at 12:21 PM Justin Pryzby <pryzby(at)telsasoft(dot)com> wrote:
>
> On Mon, Jan 11, 2021 at 12:11:54PM +0530, Dilip Kumar wrote:
> > On Mon, Jan 11, 2021 at 11:00 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> > > On Sun, Jan 10, 2021 at 10:59 PM Justin Pryzby <pryzby(at)telsasoft(dot)com> wrote:
> > > >
> > > > On Mon, Jan 04, 2021 at 04:57:16PM +0530, Dilip Kumar wrote:
> > > > > On Mon, Jan 4, 2021 at 6:52 AM Justin Pryzby <pryzby(at)telsasoft(dot)com> wrote:
> > > > > > And fails pg_upgrade check, apparently losing track of the compression (?)
> > > > > >
> > > > > > CREATE TABLE public.cmdata2 (
> > > > > > - f1 text COMPRESSION lz4
> > > > > > + f1 text
> > > > > > );
> > > > >
> > > > > I did not get this? pg_upgrade check is passing for me.
> > > >
> > > > I realized that this was failing in your v16 patch sent Dec 25.
> > > > It's passing on current patches because they do "DROP TABLE cmdata2", but
> > > > that's only masking the error.
> >
> > I tested specifically pg_upgrade by removing all the DROP table and MV
> > and it is passing. I don't see the reason why should it fail. I mean
> > after the upgrade why COMPRESSION lz4 is missing?
>
> How did you test it ?
>
> I'm not completely clear how this is intended to work... has it been tested
> before ? According to the comments, in binary upgrade mode, there's an ALTER
> which is supposed to SET COMPRESSION, but that's evidently not happening.

I am able to reproduce this issue, If I run pg_dump with
binary_upgrade mode then I can see the issue (./pg_dump
--binary-upgrade -d Postgres). Yes you are right that for fixing
this there should be an ALTER..SET COMPRESSION method.

> > > > I found that's the AM's OID in the old clsuter:
> > > > regression=# SELECT * FROM pg_am WHERE oid=36447;
> > > > oid | amname | amhandler | amtype
> > > > -------+--------+-------------+--------
> > > > 36447 | pglz2 | pglzhandler | c
> > > >
> > > > But in the new cluster, the OID has changed. Since that's written into table
> > > > data, I think you have to ensure that the compression OIDs are preserved on
> > > > upgrade:
> > > >
> > > > 16755 | pglz2 | pglzhandler | c
> > >
> > > Yeah, basically we are storing am oid in the compressed data so Oid
> > > must be preserved. I will look into this and fix it.
> >
> > On further analysis, if we are dumping and restoring then we will
> > compress the data back while inserting it so why would we need to old
> > OID. I mean in the new cluster we are inserting data again so it will
> > be compressed again and now it will store the new OID. Am I missing
> > something here?
>
> I'm referring to pg_upgrade which uses pg_dump, but does *not* re-insert data,
> but rather recreates catalogs only and then links to the old tables (either
> with copy, link, or clone). Test with make -C src/bin/pg_upgrade (which is
> included in make check-world).

Got this as well.

I will fix these two issues and post the updated patch by tomorrow.

Thanks for your findings.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

From:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To:	Justin Pryzby <pryzby(at)telsasoft(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, David Steele <david(at)pgmasters(dot)net>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2021-01-13 08:44:01
Message-ID:	CAFiTN-uGY6fp9p0dpdUdodW4x7fzYFuEzRpn8fsweznivAqSvQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Jan 11, 2021 at 3:40 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> On Mon, Jan 11, 2021 at 12:21 PM Justin Pryzby <pryzby(at)telsasoft(dot)com> wrote:
> >
> > On Mon, Jan 11, 2021 at 12:11:54PM +0530, Dilip Kumar wrote:
> > > On Mon, Jan 11, 2021 at 11:00 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> > > > On Sun, Jan 10, 2021 at 10:59 PM Justin Pryzby <pryzby(at)telsasoft(dot)com> wrote:
> > > > >
> > > > > On Mon, Jan 04, 2021 at 04:57:16PM +0530, Dilip Kumar wrote:
> > > > > > On Mon, Jan 4, 2021 at 6:52 AM Justin Pryzby <pryzby(at)telsasoft(dot)com> wrote:
> > > > > > > And fails pg_upgrade check, apparently losing track of the compression (?)
> > > > > > >
> > > > > > > CREATE TABLE public.cmdata2 (
> > > > > > > - f1 text COMPRESSION lz4
> > > > > > > + f1 text
> > > > > > > );
> > > > > >
> > > > > > I did not get this? pg_upgrade check is passing for me.
> > > > >
> > > > > I realized that this was failing in your v16 patch sent Dec 25.
> > > > > It's passing on current patches because they do "DROP TABLE cmdata2", but
> > > > > that's only masking the error.
> > >
> > > I tested specifically pg_upgrade by removing all the DROP table and MV
> > > and it is passing. I don't see the reason why should it fail. I mean
> > > after the upgrade why COMPRESSION lz4 is missing?
> >
> > How did you test it ?
> >
> > I'm not completely clear how this is intended to work... has it been tested
> > before ? According to the comments, in binary upgrade mode, there's an ALTER
> > which is supposed to SET COMPRESSION, but that's evidently not happening.
>
> I am able to reproduce this issue, If I run pg_dump with
> binary_upgrade mode then I can see the issue (./pg_dump
> --binary-upgrade -d Postgres). Yes you are right that for fixing
> this there should be an ALTER..SET COMPRESSION method.
>
> > > > > I found that's the AM's OID in the old clsuter:
> > > > > regression=# SELECT * FROM pg_am WHERE oid=36447;
> > > > > oid | amname | amhandler | amtype
> > > > > -------+--------+-------------+--------
> > > > > 36447 | pglz2 | pglzhandler | c
> > > > >
> > > > > But in the new cluster, the OID has changed. Since that's written into table
> > > > > data, I think you have to ensure that the compression OIDs are preserved on
> > > > > upgrade:
> > > > >
> > > > > 16755 | pglz2 | pglzhandler | c
> > > >
> > > > Yeah, basically we are storing am oid in the compressed data so Oid
> > > > must be preserved. I will look into this and fix it.
> > >
> > > On further analysis, if we are dumping and restoring then we will
> > > compress the data back while inserting it so why would we need to old
> > > OID. I mean in the new cluster we are inserting data again so it will
> > > be compressed again and now it will store the new OID. Am I missing
> > > something here?
> >
> > I'm referring to pg_upgrade which uses pg_dump, but does *not* re-insert data,
> > but rather recreates catalogs only and then links to the old tables (either
> > with copy, link, or clone). Test with make -C src/bin/pg_upgrade (which is
> > included in make check-world).
>
> Got this as well.
>
> I will fix these two issues and post the updated patch by tomorrow.
>
> Thanks for your findings.

I have fixed this issue in the v18 version, please test and let me
know your thoughts. There is one more issue pending from an upgrade
perspective in v18-0003, basically, for the preserved method we need
to restore the dependency as well. I will work on this part and
shared the next version soon.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment	Content-Type	Size
v18-0002-alter-table-set-compression.patch	application/octet-stream	17.8 KB
v18-0003-Add-support-for-PRESERVE.patch	application/octet-stream	46.0 KB
v18-0005-new-compression-method-extension-for-zlib.patch	application/octet-stream	10.0 KB
v18-0004-Create-custom-compression-methods.patch	application/octet-stream	31.1 KB
v18-0001-Built-in-compression-method.patch	application/octet-stream	228.0 KB
v18-0006-Support-compression-methods-options.patch	application/octet-stream	61.6 KB

From:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To:	Justin Pryzby <pryzby(at)telsasoft(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, David Steele <david(at)pgmasters(dot)net>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: [HACKERS] Custom compression methods
Date:	2021-01-19 09:53:55
Message-ID:	CAFiTN-tbLA1QYNcxD9Ryes_h25uuac7+oO_f4mqMfdMAxdzttA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Jan 13, 2021 at 2:14 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> On Mon, Jan 11, 2021 at 3:40 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> >
> > On Mon, Jan 11, 2021 at 12:21 PM Justin Pryzby <pryzby(at)telsasoft(dot)com> wrote:
> > >
> > > On Mon, Jan 11, 2021 at 12:11:54PM +0530, Dilip Kumar wrote:
> > > > On Mon, Jan 11, 2021 at 11:00 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> > > > > On Sun, Jan 10, 2021 at 10:59 PM Justin Pryzby <pryzby(at)telsasoft(dot)com> wrote:
> > > > > >
> > > > > > On Mon, Jan 04, 2021 at 04:57:16PM +0530, Dilip Kumar wrote:
> > > > > > > On Mon, Jan 4, 2021 at 6:52 AM Justin Pryzby <pryzby(at)telsasoft(dot)com> wrote:
> > > > > > > > And fails pg_upgrade check, apparently losing track of the compression (?)
> > > > > > > >
> > > > > > > > CREATE TABLE public.cmdata2 (
> > > > > > > > - f1 text COMPRESSION lz4
> > > > > > > > + f1 text
> > > > > > > > );
> > > > > > >
> > > > > > > I did not get this? pg_upgrade check is passing for me.
> > > > > >
> > > > > > I realized that this was failing in your v16 patch sent Dec 25.
> > > > > > It's passing on current patches because they do "DROP TABLE cmdata2", but
> > > > > > that's only masking the error.
> > > >
> > > > I tested specifically pg_upgrade by removing all the DROP table and MV
> > > > and it is passing. I don't see the reason why should it fail. I mean
> > > > after the upgrade why COMPRESSION lz4 is missing?
> > >
> > > How did you test it ?
> > >
> > > I'm not completely clear how this is intended to work... has it been tested
> > > before ? According to the comments, in binary upgrade mode, there's an ALTER
> > > which is supposed to SET COMPRESSION, but that's evidently not happening.
> >
> > I am able to reproduce this issue, If I run pg_dump with
> > binary_upgrade mode then I can see the issue (./pg_dump
> > --binary-upgrade -d Postgres). Yes you are right that for fixing
> > this there should be an ALTER..SET COMPRESSION method.
> >
> > > > > > I found that's the AM's OID in the old clsuter:
> > > > > > regression=# SELECT * FROM pg_am WHERE oid=36447;
> > > > > > oid | amname | amhandler | amtype
> > > > > > -------+--------+-------------+--------
> > > > > > 36447 | pglz2 | pglzhandler | c
> > > > > >
> > > > > > But in the new cluster, the OID has changed. Since that's written into table
> > > > > > data, I think you have to ensure that the compression OIDs are preserved on
> > > > > > upgrade:
> > > > > >
> > > > > > 16755 | pglz2 | pglzhandler | c
> > > > >
> > > > > Yeah, basically we are storing am oid in the compressed data so Oid
> > > > > must be preserved. I will look into this and fix it.
> > > >
> > > > On further analysis, if we are dumping and restoring then we will
> > > > compress the data back while inserting it so why would we need to old
> > > > OID. I mean in the new cluster we are inserting data again so it will
> > > > be compressed again and now it will store the new OID. Am I missing
> > > > something here?
> > >
> > > I'm referring to pg_upgrade which uses pg_dump, but does *not* re-insert data,
> > > but rather recreates catalogs only and then links to the old tables (either
> > > with copy, link, or clone). Test with make -C src/bin/pg_upgrade (which is
> > > included in make check-world).
> >
> > Got this as well.
> >
> > I will fix these two issues and post the updated patch by tomorrow.
> >
> > Thanks for your findings.
>
> I have fixed this issue in the v18 version, please test and let me
> know your thoughts. There is one more issue pending from an upgrade
> perspective in v18-0003, basically, for the preserved method we need
> to restore the dependency as well. I will work on this part and
> shared the next version soon.

Now I have added support for handling the preserved method in the
binary upgrade, please find the updated patch set.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment	Content-Type	Size
v19-0002-alter-table-set-compression.patch	application/octet-stream	17.8 KB
v19-0004-Create-custom-compression-methods.patch	application/octet-stream	31.1 KB
v19-0003-Add-support-for-PRESERVE.patch	application/octet-stream	51.2 KB
v19-0005-new-compression-method-extension-for-zlib.patch	application/octet-stream	10.0 KB
v19-0001-Built-in-compression-method.patch	application/octet-stream	228.0 KB
v19-0006-Support-compression-methods-options.patch	application/octet-stream	61.7 KB