Quick Links

pgsql: Improve runtime and output of tests for replication slots checkp

Lists:	pgsql-committers

From:	Alexander Korotkov <akorotkov(at)postgresql(dot)org>
To:	pgsql-committers(at)lists(dot)postgresql(dot)org
Subject:	pgsql: Improve runtime and output of tests for replication slots checkp
Date:	2025-06-19 23:31:07
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-committers

Improve runtime and output of tests for replication slots checkpointing.

The TAP tests that verify logical and physical replication slot behavior
during checkpoints (046_checkpoint_logical_slot.pl and
047_checkpoint_physical_slot.pl) inserted two batches of 2 million rows each,
generating approximately 520 MB of WAL. On slow machines, or when compiled
with '-DRELCACHE_FORCE_RELEASE -DCATCACHE_FORCE_RELEASE', this caused the
tests to run for 8-9 minutes and occasionally time out, as seen on the
buildfarm animal prion.

This commit modifies the mentioned tests to utilize the $node->advance_wal()
function, thereby reducing runtime. Once we do not use the generated data,
the proposed function is a good alternative, which cuts the total wall-clock
run time.

While here, remove superfluous '\n' characters from several note() calls;
these appeared literally in the build-farm logs and looked odd. Also, remove
excessive 'shared_preload_libraries' GUC from the config and add a check for
'injection_points' extension availability.

Reported-by: Alexander Lakhin <exclusion(at)gmail(dot)com>
Reported-by: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Author: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Author: Vitaly Davydov <v(dot)davydov(at)postgrespro(dot)ru>
Reviewed-by: Hayato Kuroda <kuroda(dot)hayato(at)fujitsu(dot)com>
Discussion: https://postgr.es/m/fbc5d94e-6fbd-4a64-85d4-c9e284a58eb2%40gmail.com
Backpatch-through: 17

Branch
------
REL_17_STABLE

Details
-------
https://git.postgresql.org/pg/commitdiff/5ed50f9386f04d9b8a88af8902b06cb747db7b54

Modified Files
--------------
src/test/recovery/t/046_checkpoint_logical_slot.pl | 34 +++++++++++-----------
.../recovery/t/047_checkpoint_physical_slot.pl | 26 +++++++++--------
2 files changed, 31 insertions(+), 29 deletions(-)

From:	Melanie Plageman <melanieplageman(at)gmail(dot)com>
To:	Alexander Korotkov <akorotkov(at)postgresql(dot)org>
Cc:	pgsql-committers(at)lists(dot)postgresql(dot)org
Subject:	Re: pgsql: Improve runtime and output of tests for replication slots checkp
Date:	2025-06-20 16:10:07
Message-ID:	CAAKRu_ZCOzQpEumLFgG_+iw3FTa+hJ4SRpxzaQBYxxM_ZAzWcA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-committers

On Thu, Jun 19, 2025 at 7:31 PM Alexander Korotkov <akorotkov(at)postgresql(dot)org>
wrote:

> Improve runtime and output of tests for replication slots checkpointing.
>
> The TAP tests that verify logical and physical replication slot behavior
> during checkpoints (046_checkpoint_logical_slot.pl and
> 047_checkpoint_physical_slot.pl) inserted two batches of 2 million rows
> each,
> generating approximately 520 MB of WAL. On slow machines, or when compiled
> with '-DRELCACHE_FORCE_RELEASE -DCATCACHE_FORCE_RELEASE', this caused the
> tests to run for 8-9 minutes and occasionally time out, as seen on the
> buildfarm animal prion.

Quite a few animals have started failing since this commit (for example
[1]) . I haven't looked into why, but I suspect something is wrong.

stderr:
# Failed test 'Logical slot still valid'
# at /home/bf/bf-build/flaviventris/HEAD/pgsql/src/test/recovery/t/046_checkpoint_logical_slot.pl
line 134.
# got: 'death by signal at
/home/bf/bf-build/flaviventris/HEAD/pgsql/src/test/perl/PostgreSQL/Test/Cluster.pm
line 181.
# '
# expected: ''

I see this failing on my fork's CI, so it seems like it could have been
caught earlier?

- Melanie

[1]
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris&dt=2025-06-19%2023%3A47%3A08

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Melanie Plageman <melanieplageman(at)gmail(dot)com>
Cc:	Alexander Korotkov <akorotkov(at)postgresql(dot)org>, pgsql-committers <pgsql-committers(at)lists(dot)postgresql(dot)org>
Subject:	Re: pgsql: Improve runtime and output of tests for replication slots checkp
Date:	2025-06-20 17:04:37
Message-ID:	CAPpHfduq==JJP6f2eKLxvKXBqLJ25QYUJkACDGwPSA3diC_AEg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-committers

On Fri, Jun 20, 2025, 19:10 Melanie Plageman <melanieplageman(at)gmail(dot)com>
wrote:

>
> On Thu, Jun 19, 2025 at 7:31 PM Alexander Korotkov <
> akorotkov(at)postgresql(dot)org> wrote:
>
>> Improve runtime and output of tests for replication slots checkpointing.
>>
>> The TAP tests that verify logical and physical replication slot behavior
>> during checkpoints (046_checkpoint_logical_slot.pl and
>> 047_checkpoint_physical_slot.pl) inserted two batches of 2 million rows
>> each,
>> generating approximately 520 MB of WAL. On slow machines, or when
>> compiled
>> with '-DRELCACHE_FORCE_RELEASE -DCATCACHE_FORCE_RELEASE', this caused the
>> tests to run for 8-9 minutes and occasionally time out, as seen on the
>> buildfarm animal prion.
>
>
> Quite a few animals have started failing since this commit (for example
> [1]) . I haven't looked into why, but I suspect something is wrong.
>
> stderr:
> # Failed test 'Logical slot still valid'
> # at /home/bf/bf-build/flaviventris/HEAD/pgsql/src/test/recovery/t/046_checkpoint_logical_slot.pl line 134.
> # got: 'death by signal at /home/bf/bf-build/flaviventris/HEAD/pgsql/src/test/perl/PostgreSQL/Test/Cluster.pm line 181.
> # '
> # expected: ''
>
> I see this failing on my fork's CI, so it seems like it could have been
> caught earlier?
>

Melanie, thank you for pointing. I'll take a look at this later today.

------
Regards,
Alexander Korotkov

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Melanie Plageman <melanieplageman(at)gmail(dot)com>
Cc:	Alexander Korotkov <akorotkov(at)postgresql(dot)org>, pgsql-committers(at)lists(dot)postgresql(dot)org
Subject:	Re: pgsql: Improve runtime and output of tests for replication slots checkp
Date:	2025-06-20 22:03:05
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-committers

Melanie Plageman <melanieplageman(at)gmail(dot)com> writes:
> Quite a few animals have started failing since this commit (for example
> [1]) . I haven't looked into why, but I suspect something is wrong.

It looks to me like it's being triggered by this questionable bit in
046_checkpoint_logical_slot.pl:

# Continue the checkpoint.
$node->safe_psql('postgres',
q{select injection_points_wakeup('checkpoint-before-old-wal-removal')});

# Abruptly stop the server (1 second should be enough for the checkpoint
# to finish; it would be better).
$node->stop('immediate');

That second comment is pretty unintelligible, but I think it's
expecting that we'd give the checkpoint 1 second to complete,
which the code is *not* doing. On my own machine it looks like the
checkpoint does manage to complete within about 1ms, just barely
before the shutdown arrives:

2025-06-20 17:52:25.599 EDT [2538690] 046_checkpoint_logical_slot.pl LOG: statement: select pg_replication_slot_advance('slot_physical', pg_current_wal_lsn())
2025-06-20 17:52:25.602 EDT [2538692] 046_checkpoint_logical_slot.pl LOG: statement: select injection_points_wakeup('checkpoint-before-old-wal-removal')
2025-06-20 17:52:25.603 EDT [2538557] LOG: checkpoint complete: wrote 1 buffers (0.0%), wrote 0 SLRU buffers; 0 WAL file(s) added, 0 removed, 0 recycled; write=0.003 s, sync=0.001 s, total=1.074 s; sync files=0, longest=0.000 s, average=0.000 s; distance=327688 kB, estimate=327688 kB; lsn=0/290020C0, redo lsn=0/29002068
2025-06-20 17:52:25.604 EDT [2538553] LOG: received immediate shutdown request

But in the buildfarm failures I don't see any 'checkpoint complete'
before the shutdown.

If this is an accurate diagnosis then it indicates both a test bug
(it should delay here, or else the comment needs fixed to explain
what we're actually testing) and a backend bug, because an immediate
stop a/k/a crash before completing the checkpoint should not lead to
failure to function after the next restart.

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Melanie Plageman <melanieplageman(at)gmail(dot)com>
Cc:	Alexander Korotkov <akorotkov(at)postgresql(dot)org>, pgsql-committers(at)lists(dot)postgresql(dot)org
Subject:	Re: pgsql: Improve runtime and output of tests for replication slots checkp
Date:	2025-06-20 22:24:58
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-committers

I wrote:
> But in the buildfarm failures I don't see any 'checkpoint complete'
> before the shutdown.

Ooops, I lied: we have at least one case where the checkpoint does
finish but then it hangs up anyway:

https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=melonworm&dt=2025-06-20%2019%3A59%3A31&stg=recovery-check

2025-06-20 20:05:32.548 UTC [3171474][client backend][61/2:0] LOG: statement: select injection_points_wakeup('checkpoint-before-old-wal-removal')
2025-06-20 20:05:32.550 UTC [3162113][checkpointer][:0] LOG: checkpoint complete: wrote 1 buffers (0.0%), wrote 0 SLRU buffers; 0 WAL file(s) added, 0 removed, 0 recycled; write=0.045 s, sync=0.001 s, total=1.507 s; sync files=1, longest=0.001 s, average=0.001 s; distance=327688 kB, estimate=327688 kB; lsn=0/290020C0, redo lsn=0/29002068
2025-06-20 20:05:32.551 UTC [3171474][client backend][:0] LOG: disconnection: session time: 0:00:00.073 user=bf database=postgres host=[local]
2025-06-20 20:05:32.552 UTC [3170563][client backend][:0] LOG: disconnection: session time: 0:00:01.538 user=bf database=postgres host=[local]
2025-06-20 20:05:32.564 UTC [3163446][client backend][7/146:0] LOG: statement: SELECT 1
2025-06-20 20:05:32.564 UTC [3161873][postmaster][:0] LOG: received immediate shutdown request
2025-06-20 20:05:32.604 UTC [3161873][postmaster][:0] LOG: database system is shut down

I'm still confused about whether the 046 script intends to
sometimes test cases where the checkpoint doesn't have time to
complete. But it seems that whatever is going on here
is a bit subtle and platform-dependent.

regards, tom lane

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Melanie Plageman <melanieplageman(at)gmail(dot)com>, Alexander Korotkov <akorotkov(at)postgresql(dot)org>, pgsql-committers(at)lists(dot)postgresql(dot)org
Subject:	Re: pgsql: Improve runtime and output of tests for replication slots checkp
Date:	2025-06-20 22:40:46
Message-ID:	CAPpHfdsedYNmKDnDuueJor2k8kA+hgEJKFJRf4PhDowdP_A4Bg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-committers

On Sat, Jun 21, 2025 at 1:25 AM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> I wrote:
> > But in the buildfarm failures I don't see any 'checkpoint complete'
> > before the shutdown.
>
> Ooops, I lied: we have at least one case where the checkpoint does
> finish but then it hangs up anyway:
>
> https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=melonworm&dt=2025-06-20%2019%3A59%3A31&stg=recovery-check
>
> 2025-06-20 20:05:32.548 UTC [3171474][client backend][61/2:0] LOG: statement: select injection_points_wakeup('checkpoint-before-old-wal-removal')
> 2025-06-20 20:05:32.550 UTC [3162113][checkpointer][:0] LOG: checkpoint complete: wrote 1 buffers (0.0%), wrote 0 SLRU buffers; 0 WAL file(s) added, 0 removed, 0 recycled; write=0.045 s, sync=0.001 s, total=1.507 s; sync files=1, longest=0.001 s, average=0.001 s; distance=327688 kB, estimate=327688 kB; lsn=0/290020C0, redo lsn=0/29002068
> 2025-06-20 20:05:32.551 UTC [3171474][client backend][:0] LOG: disconnection: session time: 0:00:00.073 user=bf database=postgres host=[local]
> 2025-06-20 20:05:32.552 UTC [3170563][client backend][:0] LOG: disconnection: session time: 0:00:01.538 user=bf database=postgres host=[local]
> 2025-06-20 20:05:32.564 UTC [3163446][client backend][7/146:0] LOG: statement: SELECT 1
> 2025-06-20 20:05:32.564 UTC [3161873][postmaster][:0] LOG: received immediate shutdown request
> 2025-06-20 20:05:32.604 UTC [3161873][postmaster][:0] LOG: database system is shut down
>
> I'm still confused about whether the 046 script intends to
> sometimes test cases where the checkpoint doesn't have time to
> complete. But it seems that whatever is going on here
> is a bit subtle and platform-dependent.

I think this indicates unfinished intention to wait for checkpoint
completion. But I think both cases (checkpoint finished and
unfinished) should work correctly. So, I believe there is a backend
problem. I'm trying to reproduce this locally. Sorry for the
confusion.

------
Regards,
Alexander Korotkov
Supabase

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Melanie Plageman <melanieplageman(at)gmail(dot)com>, Alexander Korotkov <akorotkov(at)postgresql(dot)org>, pgsql-committers(at)lists(dot)postgresql(dot)org
Subject:	Re: pgsql: Improve runtime and output of tests for replication slots checkp
Date:	2025-06-20 23:28:32
Message-ID:	CAPpHfdtmLNpswWzk2qQ7-nMBLod2GrRWwMvfb0hc7tQfDD--ZQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-committers

On Sat, Jun 21, 2025 at 1:40 AM Alexander Korotkov <aekorotkov(at)gmail(dot)com>
wrote:
> On Sat, Jun 21, 2025 at 1:25 AM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> > I wrote:
> > > But in the buildfarm failures I don't see any 'checkpoint complete'
> > > before the shutdown.
> >
> > Ooops, I lied: we have at least one case where the checkpoint does
> > finish but then it hangs up anyway:
> >
> >
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=melonworm&dt=2025-06-20%2019%3A59%3A31&stg=recovery-check
> >
> > 2025-06-20 20:05:32.548 UTC [3171474][client backend][61/2:0] LOG:
statement: select
injection_points_wakeup('checkpoint-before-old-wal-removal')
> > 2025-06-20 20:05:32.550 UTC [3162113][checkpointer][:0] LOG:
checkpoint complete: wrote 1 buffers (0.0%), wrote 0 SLRU buffers; 0 WAL
file(s) added, 0 removed, 0 recycled; write=0.045 s, sync=0.001 s,
total=1.507 s; sync files=1, longest=0.001 s, average=0.001 s;
distance=327688 kB, estimate=327688 kB; lsn=0/290020C0, redo lsn=0/29002068
> > 2025-06-20 20:05:32.551 UTC [3171474][client backend][:0] LOG:
disconnection: session time: 0:00:00.073 user=bf database=postgres
host=[local]
> > 2025-06-20 20:05:32.552 UTC [3170563][client backend][:0] LOG:
disconnection: session time: 0:00:01.538 user=bf database=postgres
host=[local]
> > 2025-06-20 20:05:32.564 UTC [3163446][client backend][7/146:0] LOG:
statement: SELECT 1
> > 2025-06-20 20:05:32.564 UTC [3161873][postmaster][:0] LOG: received
immediate shutdown request
> > 2025-06-20 20:05:32.604 UTC [3161873][postmaster][:0] LOG: database
system is shut down
> >
> > I'm still confused about whether the 046 script intends to
> > sometimes test cases where the checkpoint doesn't have time to
> > complete. But it seems that whatever is going on here
> > is a bit subtle and platform-dependent.
>
> I think this indicates unfinished intention to wait for checkpoint
> completion. But I think both cases (checkpoint finished and
> unfinished) should work correctly. So, I believe there is a backend
> problem. I'm trying to reproduce this locally. Sorry for the
> confusion.

I manage to reproduce the bug locally. I've got the following backtrace.

* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
* frame #0: 0x000000019e6fd3c8 libsystem_kernel.dylib`__semwait_signal + 8
frame #1: 0x000000019e5de568 libsystem_c.dylib`nanosleep + 220
frame #2: 0x0000000103009628 postgres`pg_usleep(microsec=1000) at
pgsleep.c:50:10
frame #3: 0x00000001027bd248
postgres`read_local_xlog_page_guts(state=0x000000011383ed58,
targetPagePtr=687882240, reqLen=277, targetRecPtr=687874360,
cur_page="\U00000018\xd1\U00000005", wait_for_wal=true) at xlogutils.c:950:4
frame #4: 0x00000001027bd130
postgres`read_local_xlog_page(state=0x000000011383ed58,
targetPagePtr=687882240, reqLen=277, targetRecPtr=687874360,
cur_page="\U00000018\xd1\U00000005") at xlogutils.c:848:9
frame #5: 0x00000001027ae184
postgres`ReadPageInternal(state=0x000000011383ed58, pageptr=687882240,
reqLen=277) at xlogreader.c:1061:12
frame #6: 0x00000001027acd14
postgres`XLogDecodeNextRecord(state=0x000000011383ed58, nonblocking=false)
at xlogreader.c:727:14
frame #7: 0x00000001027ac6bc
postgres`XLogReadAhead(state=0x000000011383ed58, nonblocking=false) at
xlogreader.c:974:11
frame #8: 0x00000001027ac5c0
postgres`XLogReadRecord(state=0x000000011383ed58,
errormsg=0x000000016d798dd0) at xlogreader.c:405:3
frame #9: 0x0000000102c1d1d0
postgres`pg_logical_slot_get_changes_guts(fcinfo=0x0000000113830700,
confirm=true, binary=false) at logicalfuncs.c:257:13
frame #10: 0x0000000102c1ca00
postgres`pg_logical_slot_get_changes(fcinfo=0x0000000113830700) at
logicalfuncs.c:333:9
frame #11: 0x00000001029e1adc
postgres`ExecMakeTableFunctionResult(setexpr=0x0000000113825440,
econtext=0x00000001138252e0, argContext=0x0000000113830600,
expectedDesc=0x000000011382ecb0, randomAccess=false) at execSRF.c:234:13

While looking at frame 5 I found the following code to be executed.

targetPagePtr = RecPtr - (RecPtr % XLOG_BLCKSZ);
targetRecOff = RecPtr % XLOG_BLCKSZ;

/*
* Read the page containing the record into state->readBuf. Request enough
* byte to cover the whole record header, or at least the part of it that
* fits on the same page.
*/
readOff = ReadPageInternal(state, targetPagePtr,
Min(targetRecOff + SizeOfXLogRecord, XLOG_BLCKSZ));

And I see the following variable values.

(lldb) p/x targetPagePtr
(XLogRecPtr) 0x0000000029004000
(lldb) p/x RecPtr
(XLogRecPtr) 0x0000000029002138

I hardly understand how is this possible given it was compiled with "-O0".
I'm planning to continue investigating this tomorrow.

------
Regards,
Alexander Korotkov
Supabase

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc:	Melanie Plageman <melanieplageman(at)gmail(dot)com>, pgsql-committers(at)lists(dot)postgresql(dot)org
Subject:	Re: pgsql: Improve runtime and output of tests for replication slots checkp
Date:	2025-06-20 23:30:58
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-committers

Alexander Korotkov <aekorotkov(at)gmail(dot)com> writes:
> I think this indicates unfinished intention to wait for checkpoint
> completion. But I think both cases (checkpoint finished and
> unfinished) should work correctly. So, I believe there is a backend
> problem. I'm trying to reproduce this locally. Sorry for the
> confusion.

I've found it not hugely difficult to reproduce: on two different
Mac M4 machines, repeated runs of 046_checkpoint_logical_slot.pl
hung up after 30 or so tries. Backtrace of the stuck query
looks like

(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
* frame #0: 0x00000001836b01c8 libsystem_kernel.dylib`__semwait_signal + 8
frame #1: 0x000000018358c6f4 libsystem_c.dylib`nanosleep + 220
frame #2: 0x000000010070d358 postgres`pg_usleep(microsec=1000) at pgsleep.c:50:10 [opt]
frame #3: 0x00000001002a0508 postgres`read_local_xlog_page_guts(state=0x0000000126057298, targetPagePtr=687882240, reqLen=277, targetRecPtr=<unavailable>, cur_page="\U00000018\xd1\U00000005", wait_for_wal=true) at xlogutils.c:950:4 [opt]
frame #4: 0x0000000100298f0c postgres`ReadPageInternal(state=0x0000000126057298, pageptr=687882240, reqLen=277) at xlogreader.c:1061:12 [opt]
frame #5: 0x0000000100298300 postgres`XLogReadAhead [inlined] XLogDecodeNextRecord(state=0x0000000126057298, nonblocking=<unavailable>) at xlogreader.c:727:14 [opt]
frame #6: 0x00000001002980b4 postgres`XLogReadAhead(state=0x0000000126057298, nonblocking=false) at xlogreader.c:974:11 [opt]
frame #7: 0x0000000100297fd4 postgres`XLogReadRecord(state=0x0000000126057298, errormsg=0x000000016fc24d08) at xlogreader.c:405:3 [opt]
frame #8: 0x00000001004ee514 postgres`pg_logical_slot_get_changes_guts(fcinfo=<unavailable>, confirm=true, binary=<unavailable>) at logicalfuncs.c:257:13 [opt]
frame #9: 0x00000001004ee1e8 postgres`pg_logical_slot_get_changes(fcinfo=<unavailable>) at logicalfuncs.c:333:9 [opt]
frame #10: 0x00000001003a77ec postgres`ExecMakeTableFunctionResult(setexpr=0x000000012603da40, econtext=<unavailable>, argContext=<unavailable>, expectedDesc=0x00000001260472b0, randomAccess=false) at execSRF.c:234:13 [opt]
frame #11: 0x00000001003ba02c postgres`FunctionNext(node=0x000000012603d6d0) at nodeFunctionscan.c:94:5 [opt]
frame #12: 0x00000001003b4894 postgres`fetch_input_tuple [inlined] ExecProcNode(node=0x000000012603d6d0) at executor.h:313:9 [opt]
frame #13: 0x00000001003b4878 postgres`fetch_input_tuple(aggstate=0x000000012603cf28) at nodeAgg.c:563:10 [opt]
frame #14: 0x00000001003b2eb4 postgres`ExecAgg [inlined] agg_retrieve_direct(aggstate=0x000000012603cf28) at nodeAgg.c:2450:17 [opt]
frame #15: 0x00000001003b2ab8 postgres`ExecAgg(pstate=0x000000012603cf28) at nodeAgg.c:2265:14 [opt]

so apparently it's waiting patiently for WAL that will never appear.
(I do see the 'checkpoint complete' log message before the shutdown
in this particular run.) I still have this debugger session open
if there's any particular data you'd like me to look at.

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc:	Melanie Plageman <melanieplageman(at)gmail(dot)com>, pgsql-committers(at)lists(dot)postgresql(dot)org
Subject:	Re: pgsql: Improve runtime and output of tests for replication slots checkp
Date:	2025-06-20 23:42:38
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-committers

Alexander Korotkov <aekorotkov(at)gmail(dot)com> writes:
> And I see the following variable values.

> (lldb) p/x targetPagePtr
> (XLogRecPtr) 0x0000000029004000
> (lldb) p/x RecPtr
> (XLogRecPtr) 0x0000000029002138

> I hardly understand how is this possible given it was compiled with "-O0".
> I'm planning to continue investigating this tomorrow.

Yeah, I see

(lldb) p/x targetPagePtr
(XLogRecPtr) 0x0000000029004000
(lldb) p/x RecPtr
(XLogRecPtr) 0x0000000029002138
(lldb) p/x RecPtr - (RecPtr % 8192)
(XLogRecPtr) 0x0000000029002000

We're here:

/* Calculate pointer to beginning of next page */
targetPagePtr += XLOG_BLCKSZ;

/* Wait for the next page to become available */
readOff = ReadPageInternal(state, targetPagePtr,
Min(total_len - gotlen + SizeOfXLogShortPHD,
XLOG_BLCKSZ));

so that's where the increment of targetPagePtr came from.
But "Wait for the next page to become available" seems awfully
trusting that there will be another page. Should this be
using the no-wait code path?

regards, tom lane

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Melanie Plageman <melanieplageman(at)gmail(dot)com>, pgsql-committers(at)lists(dot)postgresql(dot)org
Subject:	Re: pgsql: Improve runtime and output of tests for replication slots checkp
Date:	2025-06-22 00:16:10
Message-ID:	CAPpHfdurV-j_e0pb=UFENAy3tyzxfF+yHveNDNQk2gM82WBU5A@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-committers

On Sat, Jun 21, 2025 at 2:42 AM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
> Alexander Korotkov <aekorotkov(at)gmail(dot)com> writes:
> > And I see the following variable values.
>
> > (lldb) p/x targetPagePtr
> > (XLogRecPtr) 0x0000000029004000
> > (lldb) p/x RecPtr
> > (XLogRecPtr) 0x0000000029002138
>
> > I hardly understand how is this possible given it was compiled with "-O0".
> > I'm planning to continue investigating this tomorrow.
>
> Yeah, I see
>
> (lldb) p/x targetPagePtr
> (XLogRecPtr) 0x0000000029004000
> (lldb) p/x RecPtr
> (XLogRecPtr) 0x0000000029002138
> (lldb) p/x RecPtr - (RecPtr % 8192)
> (XLogRecPtr) 0x0000000029002000
>
> We're here:
>
> /* Calculate pointer to beginning of next page */
> targetPagePtr += XLOG_BLCKSZ;
>
> /* Wait for the next page to become available */
> readOff = ReadPageInternal(state, targetPagePtr,
> Min(total_len - gotlen + SizeOfXLogShortPHD,
> XLOG_BLCKSZ));
>
> so that's where the increment of targetPagePtr came from.
> But "Wait for the next page to become available" seems awfully
> trusting that there will be another page. Should this be
> using the no-wait code path?

Thank you for the help. It seems to me that problem is deeper. The
code seems to only trying to read till the end of given WAL record,
but can't reach it. According to the values I've seen in XLogCtl, it
seems that RedoRecPtr points somewhere inside of that record's body.
I don't feel confident about to understand what's going on and how to
fix it.

I've tried two things.
1) slot_tests_wait_for_checkpoint.patch
Make tests wait for checkpoint completion (as I think they were
originally intended). However, the problem still persists.
2) revert_slot_last_saved_restart_lsn.patch
Revert ca307d5cec90 and make new tests reserve WAL using wal_keep_size
GUC. The problem still persists. It seems to be some problem
independent to my attempts to fix retaining WAL files with slot's
restart_lsn. The new tests just spotted the existing bug.

------
Regards,
Alexander Korotkov
Supabase

Attachment	Content-Type	Size
slot_tests_wait_for_checkpoint.patch	application/octet-stream	2.1 KB
revert_slot_last_saved_restart_lsn.patch	application/octet-stream	6.9 KB

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc:	Melanie Plageman <melanieplageman(at)gmail(dot)com>, pgsql-committers(at)lists(dot)postgresql(dot)org
Subject:	Re: pgsql: Improve runtime and output of tests for replication slots checkp
Date:	2025-06-22 00:56:50
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-committers

Alexander Korotkov <aekorotkov(at)gmail(dot)com> writes:
> On Sat, Jun 21, 2025 at 2:42 AM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> But "Wait for the next page to become available" seems awfully
>> trusting that there will be another page. Should this be
>> using the no-wait code path?

> Thank you for the help. It seems to me that problem is deeper. The
> code seems to only trying to read till the end of given WAL record,
> but can't reach it. According to the values I've seen in XLogCtl, it
> seems that RedoRecPtr points somewhere inside of that record's body.
> I don't feel confident about to understand what's going on and how to
> fix it.

Hmm. My theory about what's happening is that we are writing a WAL
record that spans across a page boundary, and the asynchronous
immediate-stop request comes in and kills that operation, so that
the first half of the record is on disk but the second is not.
(This outcome would obviously be very timing-dependent.) Then
we restart the server, and crash recovery copes with that situation
successfully. But pg_logical_slot_get_changes() fails to, because
it's coded in such a way that it will wait forever for the second
half of the WAL record to appear.

The main hole in this theory is: if crash recovery thought that the
partial record was invalid, wouldn't it set the end-of-WAL position
to the start of that record? And shouldn't that stop
pg_logical_slot_get_changes() from trying to process the partial
record? So I'm sure there's some details not quite right about
this picture. But maybe it's within hailing distance of the truth.

> The new tests just spotted the existing bug.

Yeah, that's what I think too. The unintentional omission of a
pre-shutdown delay in the 046 test has exposed some pre-existing
fragility in pg_logical_slot_get_changes(). So I'm not in favor
of changing 046 till we understand this better.

regards, tom lane

From:	Michael Paquier <michael(at)paquier(dot)xyz>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, pgsql-committers(at)lists(dot)postgresql(dot)org
Subject:	Re: pgsql: Improve runtime and output of tests for replication slots checkp
Date:	2025-06-23 00:28:53
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-committers

On Sat, Jun 21, 2025 at 08:56:50PM -0400, Tom Lane wrote:
> Hmm. My theory about what's happening is that we are writing a WAL
> record that spans across a page boundary, and the asynchronous
> immediate-stop request comes in and kills that operation, so that
> the first half of the record is on disk but the second is not.
> (This outcome would obviously be very timing-dependent.) Then
> we restart the server, and crash recovery copes with that situation
> successfully. But pg_logical_slot_get_changes() fails to, because
> it's coded in such a way that it will wait forever for the second
> half of the WAL record to appear.
>
> The main hole in this theory is: if crash recovery thought that the
> partial record was invalid, wouldn't it set the end-of-WAL position
> to the start of that record? And shouldn't that stop
> pg_logical_slot_get_changes() from trying to process the partial
> record? So I'm sure there's some details not quite right about
> this picture. But maybe it's within hailing distance of the truth.

Obviously pg_logical_slot_get_changes() should not try to read the
record if it's not complete yet. XLogDecodeNextRecord() may be
trusting too much the LSN given by its caller, though.

>> The new tests just spotted the existing bug.
>
> Yeah, that's what I think too. The unintentional omission of a
> pre-shutdown delay in the 046 test has exposed some pre-existing
> fragility in pg_logical_slot_get_changes(). So I'm not in favor
> of changing 046 till we understand this better.

Yes, better to understand what's going on before changing the test,
and perhaps change 046 so as it provide stable coverage for this case,
even if discovered accidentally.

So, is it only pg_logical_slot_get_changes_guts() that's messed up in
this context, because XLogDecodeNextRecord() is trying to read pages
as it has a logic too permissive? How did any of you reproduce the
failure? Just by running the test in a loop? It is one of these
patterns where a hardcoded sleep should do the trick and help with a
bisect.

By the way, At the bottom of test 046_checkpoint_logical_slot.pl, if
you expect the checkpoint to complete before sending the immediate
shutdown request, I would suggest to use a wait_for_log() based on
"checkpoint complete" or an equivalent loop. What you are doing in
the test is unstable as written.
--
Michael

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Michael Paquier <michael(at)paquier(dot)xyz>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, pgsql-committers(at)lists(dot)postgresql(dot)org
Subject:	Re: pgsql: Improve runtime and output of tests for replication slots checkp
Date:	2025-06-23 10:24:30
Message-ID:	CAPpHfdsO9s5he3xHWNFtwvXtvsftu3nNz=PV9fdN32UOh-Z3tA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-committers

On Mon, Jun 23, 2025 at 3:29 AM Michael Paquier <michael(at)paquier(dot)xyz> wrote:
> > Yeah, that's what I think too. The unintentional omission of a
> > pre-shutdown delay in the 046 test has exposed some pre-existing
> > fragility in pg_logical_slot_get_changes(). So I'm not in favor
> > of changing 046 till we understand this better.
>
> Yes, better to understand what's going on before changing the test,
> and perhaps change 046 so as it provide stable coverage for this case,
> even if discovered accidentally.
>
> So, is it only pg_logical_slot_get_changes_guts() that's messed up in
> this context, because XLogDecodeNextRecord() is trying to read pages
> as it has a logic too permissive? How did any of you reproduce the
> failure? Just by running the test in a loop? It is one of these
> patterns where a hardcoded sleep should do the trick and help with a
> bisect.

Personally I just run the test in the loop. It takes about ~30 times to
reproduce. Works both with -O0 and -O2.

cd src/test/recovery
while PROVE_TESTS="t/046_checkpoint_logical_slot.pl t/
047_checkpoint_physical_slot.pl" make check; do :; done

> By the way, At the bottom of test 046_checkpoint_logical_slot.pl, if
> you expect the checkpoint to complete before sending the immediate
> shutdown request, I would suggest to use a wait_for_log() based on
> "checkpoint complete" or an equivalent loop. What you are doing in
> the test is unstable as written.

Exactly. I've proposed the fix with wait_for_log() in [1]. Nevertheless,
both cases (immediate stop before checkpoint completion, and immediate stop
after checkpoint completion) must work without hang.

Links.
1.
https://www.postgresql.org/message-id/CAPpHfdurV-j_e0pb%3DUFENAy3tyzxfF%2ByHveNDNQk2gM82WBU5A%40mail.gmail.com

------
Regards,
Alexander Korotkov
Supabase

From:	Jelte Fennema-Nio <postgres(at)jeltef(dot)nl>
To:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc:	Michael Paquier <michael(at)paquier(dot)xyz>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, pgsql-committers(at)lists(dot)postgresql(dot)org
Subject:	Re: pgsql: Improve runtime and output of tests for replication slots checkp
Date:	2025-06-23 11:59:47
Message-ID:	CAGECzQTx+UHMfVPgKRL0KqyeKgRoXFAp2LJbcDRTZOhduM-NKg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-committers

On Mon, 23 Jun 2025 at 12:24, Alexander Korotkov <aekorotkov(at)gmail(dot)com> wrote:
>
> On Mon, Jun 23, 2025 at 3:29 AM Michael Paquier <michael(at)paquier(dot)xyz> wrote:
> > > Yeah, that's what I think too. The unintentional omission of a
> > > pre-shutdown delay in the 046 test has exposed some pre-existing
> > > fragility in pg_logical_slot_get_changes(). So I'm not in favor
> > > of changing 046 till we understand this better.
> >
> > Yes, better to understand what's going on before changing the test,
> > and perhaps change 046 so as it provide stable coverage for this case,
> > even if discovered accidentally.
> >
> > So, is it only pg_logical_slot_get_changes_guts() that's messed up in
> > this context, because XLogDecodeNextRecord() is trying to read pages
> > as it has a logic too permissive? How did any of you reproduce the
> > failure? Just by running the test in a loop? It is one of these
> > patterns where a hardcoded sleep should do the trick and help with a
> > bisect.
>
> Personally I just run the test in the loop. It takes about ~30 times to reproduce. Works both with -O0 and -O2.
>
> cd src/test/recovery
> while PROVE_TESTS="t/046_checkpoint_logical_slot.pl t/047_checkpoint_physical_slot.pl" make check; do :; done
>
> > By the way, At the bottom of test 046_checkpoint_logical_slot.pl, if
> > you expect the checkpoint to complete before sending the immediate
> > shutdown request, I would suggest to use a wait_for_log() based on
> > "checkpoint complete" or an equivalent loop. What you are doing in
> > the test is unstable as written.
>
> Exactly. I've proposed the fix with wait_for_log() in [1]. Nevertheless, both cases (immediate stop before checkpoint completion, and immediate stop after checkpoint completion) must work without hang.

CFBot has been much more red than usual since this change afaict. Many
more windows builds are failing than usual with an error like this:

[08:28:52.683] 338/338 postgresql:recovery /
recovery/046_checkpoint_logical_slot TIMEOUT 1000.09s exit status 1

How about we revert the commit for now to get CI and the buildfarm green again?

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Jelte Fennema-Nio <postgres(at)jeltef(dot)nl>
Cc:	Michael Paquier <michael(at)paquier(dot)xyz>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, pgsql-committers(at)lists(dot)postgresql(dot)org
Subject:	Re: pgsql: Improve runtime and output of tests for replication slots checkp
Date:	2025-06-23 12:31:25
Message-ID:	CAPpHfdtnkGPMSrTaWs3vGP-FYKOXhUZq3OHDeaZwHXsAQ2g=WA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-committers

On Mon, Jun 23, 2025 at 3:00 PM Jelte Fennema-Nio <postgres(at)jeltef(dot)nl> wrote:
> On Mon, 23 Jun 2025 at 12:24, Alexander Korotkov <aekorotkov(at)gmail(dot)com> wrote:
> > On Mon, Jun 23, 2025 at 3:29 AM Michael Paquier <michael(at)paquier(dot)xyz> wrote:
> > > > Yeah, that's what I think too. The unintentional omission of a
> > > > pre-shutdown delay in the 046 test has exposed some pre-existing
> > > > fragility in pg_logical_slot_get_changes(). So I'm not in favor
> > > > of changing 046 till we understand this better.
> > >
> > > Yes, better to understand what's going on before changing the test,
> > > and perhaps change 046 so as it provide stable coverage for this case,
> > > even if discovered accidentally.
> > >
> > > So, is it only pg_logical_slot_get_changes_guts() that's messed up in
> > > this context, because XLogDecodeNextRecord() is trying to read pages
> > > as it has a logic too permissive? How did any of you reproduce the
> > > failure? Just by running the test in a loop? It is one of these
> > > patterns where a hardcoded sleep should do the trick and help with a
> > > bisect.
> >
> > Personally I just run the test in the loop. It takes about ~30 times to reproduce. Works both with -O0 and -O2.
> >
> > cd src/test/recovery
> > while PROVE_TESTS="t/046_checkpoint_logical_slot.pl t/047_checkpoint_physical_slot.pl" make check; do :; done
> >
> > > By the way, At the bottom of test 046_checkpoint_logical_slot.pl, if
> > > you expect the checkpoint to complete before sending the immediate
> > > shutdown request, I would suggest to use a wait_for_log() based on
> > > "checkpoint complete" or an equivalent loop. What you are doing in
> > > the test is unstable as written.
> >
> > Exactly. I've proposed the fix with wait_for_log() in [1]. Nevertheless, both cases (immediate stop before checkpoint completion, and immediate stop after checkpoint completion) must work without hang.
>
> CFBot has been much more red than usual since this change afaict. Many
> more windows builds are failing than usual with an error like this:
>
> [08:28:52.683] 338/338 postgresql:recovery /
> recovery/046_checkpoint_logical_slot TIMEOUT 1000.09s exit status 1
>
> How about we revert the commit for now to get CI and the buildfarm green again?

What about removing the 046_checkpoint_logical_slot which currently
causes all the buzz? I'm not yet convinced we need to revert
ca307d5cec90. Opinions?

------
Regards,
Alexander Korotkov
Supabase

From:	vignesh C <vignesh21(at)gmail(dot)com>
To:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, pgsql-committers(at)lists(dot)postgresql(dot)org
Subject:	Re: pgsql: Improve runtime and output of tests for replication slots checkp
Date:	2025-06-23 12:53:25
Message-ID:	CALDaNm1t0PZUjvAsJ3jRK_S49OmmLhqWjQC+jdidxBmZTj1FXg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-committers

On Sun, 22 Jun 2025 at 05:46, Alexander Korotkov <aekorotkov(at)gmail(dot)com> wrote:
>
> On Sat, Jun 21, 2025 at 2:42 AM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> >
> > Alexander Korotkov <aekorotkov(at)gmail(dot)com> writes:
> > > And I see the following variable values.
> >
> > > (lldb) p/x targetPagePtr
> > > (XLogRecPtr) 0x0000000029004000
> > > (lldb) p/x RecPtr
> > > (XLogRecPtr) 0x0000000029002138
> >
> > > I hardly understand how is this possible given it was compiled with "-O0".
> > > I'm planning to continue investigating this tomorrow.
> >
> > Yeah, I see
> >
> > (lldb) p/x targetPagePtr
> > (XLogRecPtr) 0x0000000029004000
> > (lldb) p/x RecPtr
> > (XLogRecPtr) 0x0000000029002138
> > (lldb) p/x RecPtr - (RecPtr % 8192)
> > (XLogRecPtr) 0x0000000029002000
> >
> > We're here:
> >
> > /* Calculate pointer to beginning of next page */
> > targetPagePtr += XLOG_BLCKSZ;
> >
> > /* Wait for the next page to become available */
> > readOff = ReadPageInternal(state, targetPagePtr,
> > Min(total_len - gotlen + SizeOfXLogShortPHD,
> > XLOG_BLCKSZ));
> >
> > so that's where the increment of targetPagePtr came from.
> > But "Wait for the next page to become available" seems awfully
> > trusting that there will be another page. Should this be
> > using the no-wait code path?
>
> Thank you for the help. It seems to me that problem is deeper. The
> code seems to only trying to read till the end of given WAL record,
> but can't reach it. According to the values I've seen in XLogCtl, it
> seems that RedoRecPtr points somewhere inside of that record's body.
> I don't feel confident about to understand what's going on and how to
> fix it.
>
> I've tried two things.
> 1) slot_tests_wait_for_checkpoint.patch
> Make tests wait for checkpoint completion (as I think they were
> originally intended). However, the problem still persists.
> 2) revert_slot_last_saved_restart_lsn.patch
> Revert ca307d5cec90 and make new tests reserve WAL using wal_keep_size
> GUC. The problem still persists. It seems to be some problem
> independent to my attempts to fix retaining WAL files with slot's
> restart_lsn. The new tests just spotted the existing bug.

I was able to reproduce this issue after a few runs. At the time of
failure, the last few WAL records were as follows:
mgr: XLOG len (rec/tot): 114/ 114, tx: 0, lsn:
0/290020C0, prev 0/29002088, desc: CHECKPOINT_ONLINE redo 0/29002068;
tli 1; prev tli 1; fpw true; wal_level logical; xid 0:753; oid 24576;
multi 1; offset 0; oldest xid 744 in DB 1; oldest multi 1 in DB 1;
oldest/newest commit timestamp xid: 0/0; oldest running xid 753;
online
rmgr: XLOG len (rec/tot): 42/ 42, tx: 0, lsn:
0/29004018, prev 0/290020C0, desc: OVERWRITE_CONTRECORD lsn
0/29002138; time 2025-06-23 11:28:22.748866 IST
rmgr: XLOG len (rec/tot): 114/ 114, tx: 0, lsn:
0/29004048, prev 0/29004018, desc: CHECKPOINT_SHUTDOWN redo
0/29004048; tli 1; prev tli 1; fpw true; wal_level logical; xid 0:753;
oid 24576; multi 1; offset 0; oldest xid 744 in DB 1; oldest multi 1
in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid
0; shutdown
rmgr: Standby len (rec/tot): 50/ 50, tx: 0, lsn:
0/290040C0, prev 0/29004048, desc: RUNNING_XACTS nextXid 753
latestCompletedXid 752 oldestRunningXid 753

When trying to read the CHECKPOINT_ONLINE record at LSN 0/290020C0,
the system determines that the total record length is 8133 bytes, we
can get these values from XLogDecodeNextRecord function:
(gdb) p total_len
$111 = 8133

Based on this, it expects to read 277 bytes from the next page:
(gdb) p gotlen
$112 = 7880 # Data available in the current page
(gdb) p SizeOfXLogShortPHD
$113 = 24
(gdb) p total_len - gotlen + SizeOfXLogShortPHD
$115 = 277 # Data expected from the next page

However, in the read_local_xlog_page_guts function, it finds that WAL
data is available only up to 0x290040F8, whereas it needs data up to
0x29004115. As a result, it continuously waits for more WAL data to
appear.

One thing I'm not sure about is why the CHECKPOINT_ONLINE record is
reported as requiring 8133 bytes. This seems unusually large for a
checkpoint record, and I plan to investigate this further. Thoughts?
Another key point is the presence of an OVERWRITE_CONTRECORD in the
next page. This indicates that the record at 0/29002138 should be
skipped as a partial record is written. However, this skip is
currently not happening, leading to the wait.
This appears to be a defect in how OVERWRITE_CONTRECORD is handled —
it fails to recognize that an OVERWRITE_CONTRECORD has been written,
indicating the current record is no longer valid and should be
skipped. As a result, it continues waiting for WAL data based on an
invalid record.

Regards,
Vignesh

From:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc:	Jelte Fennema-Nio <postgres(at)jeltef(dot)nl>, Michael Paquier <michael(at)paquier(dot)xyz>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, pgsql-committers(at)lists(dot)postgresql(dot)org
Subject:	Re: pgsql: Improve runtime and output of tests for replication slots checkp
Date:	2025-06-23 13:33:14
Message-ID:	CAA4eK1KY+hDad+VfS5A3YJm7N1pNAKxjNiiWrJNG3q0kSqM9LQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-committers

On Mon, Jun 23, 2025 at 6:01 PM Alexander Korotkov <aekorotkov(at)gmail(dot)com> wrote:
>
> On Mon, Jun 23, 2025 at 3:00 PM Jelte Fennema-Nio <postgres(at)jeltef(dot)nl> wrote:
> > On Mon, 23 Jun 2025 at 12:24, Alexander Korotkov <aekorotkov(at)gmail(dot)com> wrote:
> > > On Mon, Jun 23, 2025 at 3:29 AM Michael Paquier <michael(at)paquier(dot)xyz> wrote:
> > > > > Yeah, that's what I think too. The unintentional omission of a
> > > > > pre-shutdown delay in the 046 test has exposed some pre-existing
> > > > > fragility in pg_logical_slot_get_changes(). So I'm not in favor
> > > > > of changing 046 till we understand this better.
> > > >
> > > > Yes, better to understand what's going on before changing the test,
> > > > and perhaps change 046 so as it provide stable coverage for this case,
> > > > even if discovered accidentally.
> > > >
> > > > So, is it only pg_logical_slot_get_changes_guts() that's messed up in
> > > > this context, because XLogDecodeNextRecord() is trying to read pages
> > > > as it has a logic too permissive? How did any of you reproduce the
> > > > failure? Just by running the test in a loop? It is one of these
> > > > patterns where a hardcoded sleep should do the trick and help with a
> > > > bisect.
> > >
> > > Personally I just run the test in the loop. It takes about ~30 times to reproduce. Works both with -O0 and -O2.
> > >
> > > cd src/test/recovery
> > > while PROVE_TESTS="t/046_checkpoint_logical_slot.pl t/047_checkpoint_physical_slot.pl" make check; do :; done
> > >
> > > > By the way, At the bottom of test 046_checkpoint_logical_slot.pl, if
> > > > you expect the checkpoint to complete before sending the immediate
> > > > shutdown request, I would suggest to use a wait_for_log() based on
> > > > "checkpoint complete" or an equivalent loop. What you are doing in
> > > > the test is unstable as written.
> > >
> > > Exactly. I've proposed the fix with wait_for_log() in [1]. Nevertheless, both cases (immediate stop before checkpoint completion, and immediate stop after checkpoint completion) must work without hang.
> >
> > CFBot has been much more red than usual since this change afaict. Many
> > more windows builds are failing than usual with an error like this:
> >
> > [08:28:52.683] 338/338 postgresql:recovery /
> > recovery/046_checkpoint_logical_slot TIMEOUT 1000.09s exit status 1
> >
> > How about we revert the commit for now to get CI and the buildfarm green again?
>
> What about removing the 046_checkpoint_logical_slot which currently
> causes all the buzz? I'm not yet convinced we need to revert
> ca307d5cec90. Opinions?
>

If we decide to revert/remove anything, it is better to remove
046_checkpoint_logical_slot and keep investigating the issue. As per
the information available at this point, it appears to be a base code
bug accidentally discovered by this test case. OTOH, removing this
test has a risk that there could be a delay in finding the root cause
of the issue.

--
With Regards,
Amit Kapila.

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc:	Jelte Fennema-Nio <postgres(at)jeltef(dot)nl>, Michael Paquier <michael(at)paquier(dot)xyz>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, pgsql-committers(at)lists(dot)postgresql(dot)org
Subject:	Re: pgsql: Improve runtime and output of tests for replication slots checkp
Date:	2025-06-23 18:50:15
Message-ID:	CAPpHfdtugE8Z4ponftUwmL7tv+e4Z_jSUUv-v8-TU_VPG-baEA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-committers

On Mon, Jun 23, 2025 at 4:33 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> On Mon, Jun 23, 2025 at 6:01 PM Alexander Korotkov <aekorotkov(at)gmail(dot)com> wrote:
> >
> > On Mon, Jun 23, 2025 at 3:00 PM Jelte Fennema-Nio <postgres(at)jeltef(dot)nl> wrote:
> > > On Mon, 23 Jun 2025 at 12:24, Alexander Korotkov <aekorotkov(at)gmail(dot)com> wrote:
> > > > On Mon, Jun 23, 2025 at 3:29 AM Michael Paquier <michael(at)paquier(dot)xyz> wrote:
> > > > > > Yeah, that's what I think too. The unintentional omission of a
> > > > > > pre-shutdown delay in the 046 test has exposed some pre-existing
> > > > > > fragility in pg_logical_slot_get_changes(). So I'm not in favor
> > > > > > of changing 046 till we understand this better.
> > > > >
> > > > > Yes, better to understand what's going on before changing the test,
> > > > > and perhaps change 046 so as it provide stable coverage for this case,
> > > > > even if discovered accidentally.
> > > > >
> > > > > So, is it only pg_logical_slot_get_changes_guts() that's messed up in
> > > > > this context, because XLogDecodeNextRecord() is trying to read pages
> > > > > as it has a logic too permissive? How did any of you reproduce the
> > > > > failure? Just by running the test in a loop? It is one of these
> > > > > patterns where a hardcoded sleep should do the trick and help with a
> > > > > bisect.
> > > >
> > > > Personally I just run the test in the loop. It takes about ~30 times to reproduce. Works both with -O0 and -O2.
> > > >
> > > > cd src/test/recovery
> > > > while PROVE_TESTS="t/046_checkpoint_logical_slot.pl t/047_checkpoint_physical_slot.pl" make check; do :; done
> > > >
> > > > > By the way, At the bottom of test 046_checkpoint_logical_slot.pl, if
> > > > > you expect the checkpoint to complete before sending the immediate
> > > > > shutdown request, I would suggest to use a wait_for_log() based on
> > > > > "checkpoint complete" or an equivalent loop. What you are doing in
> > > > > the test is unstable as written.
> > > >
> > > > Exactly. I've proposed the fix with wait_for_log() in [1]. Nevertheless, both cases (immediate stop before checkpoint completion, and immediate stop after checkpoint completion) must work without hang.
> > >
> > > CFBot has been much more red than usual since this change afaict. Many
> > > more windows builds are failing than usual with an error like this:
> > >
> > > [08:28:52.683] 338/338 postgresql:recovery /
> > > recovery/046_checkpoint_logical_slot TIMEOUT 1000.09s exit status 1
> > >
> > > How about we revert the commit for now to get CI and the buildfarm green again?
> >
> > What about removing the 046_checkpoint_logical_slot which currently
> > causes all the buzz? I'm not yet convinced we need to revert
> > ca307d5cec90. Opinions?
> >
>
> If we decide to revert/remove anything, it is better to remove
> 046_checkpoint_logical_slot and keep investigating the issue. As per
> the information available at this point, it appears to be a base code
> bug accidentally discovered by this test case. OTOH, removing this
> test has a risk that there could be a delay in finding the root cause
> of the issue.

I decided to remove the test while we're investigating the issue. It
might take a bit longer for us to fix, but that wouldn't distort
others' work.

------
Regards,
Alexander Korotkov
Supabase

From:	Jelte Fennema-Nio <postgres(at)jeltef(dot)nl>
To:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, pgsql-committers(at)lists(dot)postgresql(dot)org
Subject:	Re: pgsql: Improve runtime and output of tests for replication slots checkp
Date:	2025-06-23 18:58:57
Message-ID:	CAGECzQTAYJVEFYK7kithNSouNFxf-gmZFCbyKpYn+LH43PHTvA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-committers

On Mon, 23 Jun 2025 at 20:50, Alexander Korotkov <aekorotkov(at)gmail(dot)com> wrote:
> I decided to remove the test while we're investigating the issue. It
> might take a bit longer for us to fix, but that wouldn't distort
> others' work.

Sounds good. I reset the backoff of all jobs in the CFBot database, so
that the commitfest app becomes green a bit quicker.

From:	vignesh C <vignesh21(at)gmail(dot)com>
To:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Jelte Fennema-Nio <postgres(at)jeltef(dot)nl>, Michael Paquier <michael(at)paquier(dot)xyz>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, pgsql-committers(at)lists(dot)postgresql(dot)org
Subject:	Re: pgsql: Improve runtime and output of tests for replication slots checkp
Date:	2025-06-25 16:52:02
Message-ID:	CALDaNm1avMmyUCryYHYRjXDFXaxg7J-EknmqXkLXbPiLW9Fi3A@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-committers

On Tue, 24 Jun 2025 at 00:20, Alexander Korotkov <aekorotkov(at)gmail(dot)com> wrote:
>
> On Mon, Jun 23, 2025 at 4:33 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > On Mon, Jun 23, 2025 at 6:01 PM Alexander Korotkov <aekorotkov(at)gmail(dot)com> wrote:
> > >
> > > On Mon, Jun 23, 2025 at 3:00 PM Jelte Fennema-Nio <postgres(at)jeltef(dot)nl> wrote:
> > > > On Mon, 23 Jun 2025 at 12:24, Alexander Korotkov <aekorotkov(at)gmail(dot)com> wrote:
> > > > > On Mon, Jun 23, 2025 at 3:29 AM Michael Paquier <michael(at)paquier(dot)xyz> wrote:
> > > > > > > Yeah, that's what I think too. The unintentional omission of a
> > > > > > > pre-shutdown delay in the 046 test has exposed some pre-existing
> > > > > > > fragility in pg_logical_slot_get_changes(). So I'm not in favor
> > > > > > > of changing 046 till we understand this better.
> > > > > >
> > > > > > Yes, better to understand what's going on before changing the test,
> > > > > > and perhaps change 046 so as it provide stable coverage for this case,
> > > > > > even if discovered accidentally.
> > > > > >
> > > > > > So, is it only pg_logical_slot_get_changes_guts() that's messed up in
> > > > > > this context, because XLogDecodeNextRecord() is trying to read pages
> > > > > > as it has a logic too permissive? How did any of you reproduce the
> > > > > > failure? Just by running the test in a loop? It is one of these
> > > > > > patterns where a hardcoded sleep should do the trick and help with a
> > > > > > bisect.
> > > > >
> > > > > Personally I just run the test in the loop. It takes about ~30 times to reproduce. Works both with -O0 and -O2.
> > > > >
> > > > > cd src/test/recovery
> > > > > while PROVE_TESTS="t/046_checkpoint_logical_slot.pl t/047_checkpoint_physical_slot.pl" make check; do :; done
> > > > >
> > > > > > By the way, At the bottom of test 046_checkpoint_logical_slot.pl, if
> > > > > > you expect the checkpoint to complete before sending the immediate
> > > > > > shutdown request, I would suggest to use a wait_for_log() based on
> > > > > > "checkpoint complete" or an equivalent loop. What you are doing in
> > > > > > the test is unstable as written.
> > > > >
> > > > > Exactly. I've proposed the fix with wait_for_log() in [1]. Nevertheless, both cases (immediate stop before checkpoint completion, and immediate stop after checkpoint completion) must work without hang.
> > > >
> > > > CFBot has been much more red than usual since this change afaict. Many
> > > > more windows builds are failing than usual with an error like this:
> > > >
> > > > [08:28:52.683] 338/338 postgresql:recovery /
> > > > recovery/046_checkpoint_logical_slot TIMEOUT 1000.09s exit status 1
> > > >
> > > > How about we revert the commit for now to get CI and the buildfarm green again?
> > >
> > > What about removing the 046_checkpoint_logical_slot which currently
> > > causes all the buzz? I'm not yet convinced we need to revert
> > > ca307d5cec90. Opinions?
> > >
> >
> > If we decide to revert/remove anything, it is better to remove
> > 046_checkpoint_logical_slot and keep investigating the issue. As per
> > the information available at this point, it appears to be a base code
> > bug accidentally discovered by this test case. OTOH, removing this
> > test has a risk that there could be a delay in finding the root cause
> > of the issue.
>
> I decided to remove the test while we're investigating the issue. It
> might take a bit longer for us to fix, but that wouldn't distort
> others' work.

I have analyzed this issue further and started a new thread for this at [1].
[1] - https://www.postgresql.org/message-id/CALDaNm34m36PDHzsU_GdcNXU0gLTfFY5rzh9GSQv%3Dw6B%2BQVNRQ%40mail.gmail.com

Regards,
Vignesh