Quick Links

Re: pgsql-server/ /configure /configure.in rc/incl ...

Lists:	pgsql-committerspgsql-performance

From:	tgl(at)postgresql(dot)org (Tom Lane)
To:	pgsql-committers(at)postgresql(dot)org
Subject:	pgsql-server/ /configure /configure.in rc/incl ...
Date:	2003-03-06 03:16:56
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-committers pgsql-performance

CVSROOT: /cvsroot
Module name: pgsql-server
Changes by: tgl(at)postgresql(dot)org 03/03/05 22:16:56

Modified files:
. : configure configure.in
src/include : pg_config.h.in
src/interfaces/libpq: fe-misc.c

Log message:
Use poll(2) in preference to select(2), if available. This solves
problems in applications that may have a large number of files open,
such that libpq's socket number exceeds the range supported by fd_set.
From Chris Brown.

From:	"Christopher Kings-Lynne" <chriskl(at)familyhealth(dot)com(dot)au>
To:	<pgsql-committers(at)postgresql(dot)org>, "Tom Lane" <tgl(at)postgresql(dot)org>
Subject:	Re: pgsql-server/ /configure /configure.in rc/incl ...
Date:	2003-03-06 03:30:11
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-committers pgsql-performance

Has anyone ever thought about adding kqueue (for *BSD) support to Postgres,
instead of using select?

LIBRARY
Standard C Library (libc, -lc)

SYNOPSIS
#include <sys/types.h>
#include <sys/event.h>
#include <sys/time.h>

int
kqueue(void);

int
kevent(int kq, const struct kevent *changelist, int nchanges,
struct kevent *eventlist, int nevents,
const struct timespec *timeout);

EV_SET(&kev, ident, filter, flags, fflags, data, udata);

DESCRIPTION
kqueue() provides a generic method of notifying the user when an event
happens or a condition holds, based on the results of small pieces of
kernel code termed filters. A kevent is identified by the (ident, fil-
ter) pair; there may only be one unique kevent per kqueue.

The filter is executed upon the initial registration of a kevent in
order
to detect whether a preexisting condition is present, and is also exe-
cuted whenever an event is passed to the filter for evaluation. If the
filter determines that the condition should be reported, then the
kevent
is placed on the kqueue for the user to retrieve.

The filter is also run when the user attempts to retrieve the kevent
from
the kqueue. If the filter indicates that the condition that triggered
the event no longer holds, the kevent is removed from the kqueue and is
not returned.

Chris

> CVSROOT: /cvsroot
> Module name: pgsql-server
> Changes by: tgl(at)postgresql(dot)org 03/03/05 22:16:56
>
> Modified files:
> . : configure configure.in
> src/include : pg_config.h.in
> src/interfaces/libpq: fe-misc.c
>
> Log message:
> Use poll(2) in preference to select(2), if available. This solves
> problems in applications that may have a large number of files open,
> such that libpq's socket number exceeds the range supported by fd_set.
> From Chris Brown.

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"Christopher Kings-Lynne" <chriskl(at)familyhealth(dot)com(dot)au>
Cc:	pgsql-committers(at)postgresql(dot)org
Subject:	Re: pgsql-server/ /configure /configure.in rc/incl ...
Date:	2003-03-06 03:34:27
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-committers pgsql-performance

"Christopher Kings-Lynne" <chriskl(at)familyhealth(dot)com(dot)au> writes:
> Has anyone ever thought about adding kqueue (for *BSD) support to Postgres,
> instead of using select?

Why? poll() is standard. kqueue isn't, AFAIK.

regards, tom lane

From:	"Christopher Kings-Lynne" <chriskl(at)familyhealth(dot)com(dot)au>
To:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	<pgsql-committers(at)postgresql(dot)org>
Subject:	Re: pgsql-server/ /configure /configure.in rc/incl ...
Date:	2003-03-06 03:42:42
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-committers pgsql-performance

> "Christopher Kings-Lynne" <chriskl(at)familyhealth(dot)com(dot)au> writes:
> > Has anyone ever thought about adding kqueue (for *BSD) support to
Postgres,
> > instead of using select?
>
> Why? poll() is standard. kqueue isn't, AFAIK.

It's supposed be a whole heap faster - there is no polling involved...

Chris

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"Christopher Kings-Lynne" <chriskl(at)familyhealth(dot)com(dot)au>
Cc:	pgsql-committers(at)postgresql(dot)org
Subject:	Re: pgsql-server/ /configure /configure.in rc/incl ...
Date:	2003-03-06 03:47:51
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-committers pgsql-performance

"Christopher Kings-Lynne" <chriskl(at)familyhealth(dot)com(dot)au> writes:
>>> Has anyone ever thought about adding kqueue (for *BSD) support to
>>> Postgres, instead of using select?
>>
>> Why? poll() is standard. kqueue isn't, AFAIK.

> It's supposed be a whole heap faster - there is no polling involved...

Supposed by whom? Faster than what? And how would it not poll?

The way libpq uses this call, it's either probing for current status
(timeout=0) or it's willing to block, possibly indefinitely, until the
desired condition arises. It does not sit there in a busy-wait loop.
I can't see any reason to think that an OS-specific API would give
any marked difference in performance.

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"Christopher Kings-Lynne" <chriskl(at)familyhealth(dot)com(dot)au>
Cc:	pgsql-committers(at)postgresql(dot)org
Subject:	Re: pgsql-server/ /configure /configure.in rc/incl ...
Date:	2003-03-06 04:19:16
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-committers pgsql-performance

"Christopher Kings-Lynne" <chriskl(at)familyhealth(dot)com(dot)au> writes:
> It's supposed be a whole heap faster - there is no polling involved...

I looked into this more. AFAICT, the scenario in which kqueue is
said to be faster involves watching a large number of file
descriptors simultaneously. Since libpq is only watching one
descriptor, I don't see the benefit of adopting kqueue ...

regards, tom lane

From:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>, pgsql-committers(at)postgresql(dot)org
Subject:	Re: pgsql-server/ /configure /configure.in rc/incl ...
Date:	2003-03-06 04:33:36
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-committers pgsql-performance

I assume he just assumed poll() actually polls. I doesn't. It is just
like select().

---------------------------------------------------------------------------

Tom Lane wrote:
> "Christopher Kings-Lynne" <chriskl(at)familyhealth(dot)com(dot)au> writes:
> >>> Has anyone ever thought about adding kqueue (for *BSD) support to
> >>> Postgres, instead of using select?
> >>
> >> Why? poll() is standard. kqueue isn't, AFAIK.
>
> > It's supposed be a whole heap faster - there is no polling involved...
>
> Supposed by whom? Faster than what? And how would it not poll?
>
> The way libpq uses this call, it's either probing for current status
> (timeout=0) or it's willing to block, possibly indefinitely, until the
> desired condition arises. It does not sit there in a busy-wait loop.
> I can't see any reason to think that an OS-specific API would give
> any marked difference in performance.
>
> regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: if posting/reading through Usenet, please send an appropriate
> subscribe-nomail command to majordomo(at)postgresql(dot)org so that your
> message can get through to the mailing list cleanly
>

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073

From:	Sean Chittenden <sean(at)chittenden(dot)org>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>, pgsql-committers(at)postgresql(dot)org
Subject:	Re: pgsql-server/ /configure /configure.in rc/incl ...
Date:	2003-03-06 09:41:17
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-committers pgsql-performance

> >>> Has anyone ever thought about adding kqueue (for *BSD) support to
> >>> Postgres, instead of using select?
> >>
> >> Why? poll() is standard. kqueue isn't, AFAIK.
>
> > It's supposed be a whole heap faster - there is no polling involved...
>
> Supposed by whom? Faster than what? And how would it not poll?
>
> The way libpq uses this call, it's either probing for current status
> (timeout=0) or it's willing to block, possibly indefinitely, until the
> desired condition arises. It does not sit there in a busy-wait loop.
> I can't see any reason to think that an OS-specific API would give
> any marked difference in performance.

Heh, kqueue is _the_ reason to use FreeBSD.

http://www.kegel.com/dkftpbench/Poller_bench.html#results

I've toyed with the idea of adding this because it is monstrously more
efficient than select()/poll() in basically every way, shape, and
form.

That said, in terms of performance perks, I'd think migrating the
backend to using mmap() would yield a bigger performance benefit (see
Stevens) to a larger group of people than adding FreeBSD's kqueue
interface (something I plan on doing at some point if no one beats me
to it). mmap() + write() for FreeBSD is a zero-copy socket operation
and likely is on other platforms. Reducing the number of pages that
have to be copied around would be a big win in terms of sending data
to clients as well as scanning through data. Files are also only
mmap()'ed in the kernel once with BSD's VM system which could reduce
the RAM consumed by backends considerably.

mmap() would also be an interesting way of providing some kind of
atomicity for MVCC (re: WAL, use msync() to have the mapped region hit
the disk before the change). I was actually quite surprised when I
grep'ed through the code and found that mmap() wasn't in use
_anywhere_. The TODO seems to be full of messages, but not much in
the way of authoritative statements. Is this one of the areas of
PostgreSQL that just needs to get slowly migrated to use mmap() or are
there any gaping reasons why to not use the family of system calls?

-sc

--
Sean Chittenden

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Sean Chittenden <sean(at)chittenden(dot)org>
Cc:	Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>, pgsql-committers(at)postgresql(dot)org
Subject:	Re: pgsql-server/ /configure /configure.in rc/incl ...
Date:	2003-03-06 15:25:36
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-committers pgsql-performance

Sean Chittenden <sean(at)chittenden(dot)org> writes:
> I've toyed with the idea of adding this because it is monstrously more
> efficient than select()/poll() in basically every way, shape, and
> form.

From what I've looked at, kqueue only wins when you are watching a large
number of file descriptors at the same time; which is an operation done
nowhere in Postgres. I think the above would be a complete waste of
effort.

> Is this one of the areas of
> PostgreSQL that just needs to get slowly migrated to use mmap() or are
> there any gaping reasons why to not use the family of system calls?

There has been much speculation on this, and no proof that it actually
buys us anything to justify the portability hit. There would be some
nontrivial problems to solve, such as the mechanics of accessing a
large number of files from a large number of backends without running
out of virtual memory. Also, is it guaranteed that multiple backends
mmap'ing the same block will access the very same physical buffer, and
not multiple copies? Multiple copies would be fatal. See the acrhives
for more discussion.

regards, tom lane

From:	Sean Chittenden <sean(at)chittenden(dot)org>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>, pgsql-committers(at)postgresql(dot)org, pgsql-performance(at)postgresql(dot)org
Subject:	Re: pgsql-server/ /configure /configure.in rc/incl ...
Date:	2003-03-07 00:36:40
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-committers pgsql-performance

[moving to -performance, please drop -committers from replies]

> > I've toyed with the idea of adding this because it is monstrously more
> > efficient than select()/poll() in basically every way, shape, and
> > form.
>
> From what I've looked at, kqueue only wins when you are watching a
> large number of file descriptors at the same time; which is an
> operation done nowhere in Postgres. I think the above would be a
> complete waste of effort.

It scales very well to many thousands of descriptors, but it also
works well on small numbers as well. kqueue is about 5x faster than
select() or poll() on the low end of number of fd's. As I said
earlier, I don't think there is _much_ to gain in this regard, but I
do think that it would be a speed improvement but only to one OS
supported by PostgreSQL. I think that there are bigger speed
improvements to be had elsewhere in the code.

> > Is this one of the areas of PostgreSQL that just needs to get
> > slowly migrated to use mmap() or are there any gaping reasons why
> > to not use the family of system calls?
>
> There has been much speculation on this, and no proof that it
> actually buys us anything to justify the portability hit.

Actually, I think that it wouldn't be that big of a portability hit
because you still would read() and write() as always, but in
performance sensitive areas, an #ifdef HAVE_MMAP section would have
the appropriate mmap() calls. If the system doesn't have mmap(),
there isn't much to loose and we're in the same position we're in now.

> There would be some nontrivial problems to solve, such as the
> mechanics of accessing a large number of files from a large number
> of backends without running out of virtual memory. Also, is it
> guaranteed that multiple backends mmap'ing the same block will
> access the very same physical buffer, and not multiple copies?
> Multiple copies would be fatal. See the acrhives for more
> discussion.

Have read through the archives. Making a call to madvise() will speed
up access to the pages as it gives hints to the VM about what order
the pages are accessed/used. Here are a few bits from the BSD mmap()
and madvise() man pages:

mmap(2):
MAP_NOSYNC Causes data dirtied via this VM map to be flushed to
physical media only when necessary (usually by the
pager) rather then gratuitously. Typically this pre-
vents the update daemons from flushing pages dirtied
through such maps and thus allows efficient sharing of
memory across unassociated processes using a file-
backed shared memory map. Without this option any VM
pages you dirty may be flushed to disk every so often
(every 30-60 seconds usually) which can create perfor-
mance problems if you do not need that to occur (such
as when you are using shared file-backed mmap regions
for IPC purposes). Note that VM/filesystem coherency
is maintained whether you use MAP_NOSYNC or not. This
option is not portable across UNIX platforms (yet),
though some may implement the same behavior by default.

WARNING! Extending a file with ftruncate(2), thus cre-
ating a big hole, and then filling the hole by modify-
ing a shared mmap() can lead to severe file fragmenta-
tion. In order to avoid such fragmentation you should
always pre-allocate the file's backing store by
write()ing zero's into the newly extended area prior to
modifying the area via your mmap(). The fragmentation
problem is especially sensitive to MAP_NOSYNC pages,
because pages may be flushed to disk in a totally ran-
dom order.

The same applies when using MAP_NOSYNC to implement a
file-based shared memory store. It is recommended that
you create the backing store by write()ing zero's to
the backing file rather then ftruncate()ing it. You
can test file fragmentation by observing the KB/t
(kilobytes per transfer) results from an ``iostat 1''
while reading a large file sequentially, e.g. using
``dd if=filename of=/dev/null bs=32k''.

The fsync(2) function will flush all dirty data and
metadata associated with a file, including dirty NOSYNC
VM data, to physical media. The sync(8) command and
sync(2) system call generally do not flush dirty NOSYNC
VM data. The msync(2) system call is obsolete since
BSD implements a coherent filesystem buffer cache.
However, it may be used to associate dirty VM pages
with filesystem buffers and thus cause them to be
flushed to physical media sooner rather then later.

madvise(2):
MADV_NORMAL Tells the system to revert to the default paging behav-
ior.

MADV_RANDOM Is a hint that pages will be accessed randomly, and
prefetching is likely not advantageous.

MADV_SEQUENTIAL Causes the VM system to depress the priority of pages
immediately preceding a given page when it is faulted
in.

mprotect(2):
The mprotect() system call changes the specified pages to have protection
prot. Not all implementations will guarantee protection on a page basis;
the granularity of protection changes may be as large as an entire
region. A region is the virtual address space defined by the start and
end addresses of a struct vm_map_entry.

Currently these protection bits are known, which can be combined, OR'd
together:

PROT_NONE No permissions at all.

PROT_READ The pages can be read.

PROT_WRITE The pages can be written.

PROT_EXEC The pages can be executed.

msync(2):
The msync() system call writes any modified pages back to the filesystem
and updates the file modification time. If len is 0, all modified pages
within the region containing addr will be flushed; if len is non-zero,
only those pages containing addr and len-1 succeeding locations will be
examined. The flags argument may be specified as follows:

MS_ASYNC Return immediately
MS_SYNC Perform synchronous writes
MS_INVALIDATE Invalidate all cached data

A few thoughts come to mind:

1) backends could share buffers by mmap()'ing shared regions of data.
While I haven't seen any numbers to reflect this, I'd wager that
mmap() is a faster interface than ipc.

2) It looks like while there are various file IO schemes scattered all
over the place, the bulk of the critical routines that would need
to be updated are in backend/storage/file/fd.c, more specifically:

*) fileNameOpenFile() would need the appropriate mmap() call made
to it.

*) FileTruncate() would need some attention to avoid fragmentation.

*) a new "sync" GUC would have to be introduced to handle msync
(affects only pg_fsync() and pg_fdatasync()).

3) There's a bit of code in pgsql/src/backend/storage/smgr that could
be gutted/removed. Which of those storage types are even used any
more? There's a reference in the code to PostgreSQL 3.0. :)

And I think that'd be it. The LRU code could be used if necessary to
help manage the amount of mmap()'ed in the VM at any one time, at the
very least that could be a handled by a shm var that various backends
would increment/decrement as files are open()'ed/close()'ed.

I didn't spend too long looking at this, but I _think_ that'd cover
80% of PostgreSQL's disk access needs. The next bit to possibly add
would be passing a flag on FileOpen operations that'd act as a hint to
madvise() that way the VM could proactively react to PostgreSQL's
needs.

I don't have my copy of Steven's handy (it's some 700mi away atm
otherwise I'd cite it), but if Tom or someone else has it handy, look
up the example re: the performance gain from read()'ing an mmap()'ed
file versus a non-mmap()'ed file. The difference is non-trivial and
_WELL_ worth the time given the speed increase. The same speed
benefit held true for writes as well, iirc. It's been a while, but I
think it was around page 330. The index has it listed and it's not
that hard of an example to find. -sc

--
Sean Chittenden

From:	Neil Conway <neilc(at)samurai(dot)com>
To:	Sean Chittenden <sean(at)chittenden(dot)org>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>, PostgreSQL Performance <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: [COMMITTERS] pgsql-server/ /configure /configure.in rc/incl ...
Date:	2003-03-07 00:47:52
Message-ID:	1046998072.10527.67.camel@tokyo
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-committers pgsql-performance

On Thu, 2003-03-06 at 19:36, Sean Chittenden wrote:
> I don't have my copy of Steven's handy (it's some 700mi away atm
> otherwise I'd cite it), but if Tom or someone else has it handy, look
> up the example re: the performance gain from read()'ing an mmap()'ed
> file versus a non-mmap()'ed file. The difference is non-trivial and
> _WELL_ worth the time given the speed increase.

Can anyone confirm this? If so, one easy step we could take in this
direction would be adapting COPY FROM to use mmap().

Cheers,

Neil

--
Neil Conway <neilc(at)samurai(dot)com> || PGP Key ID: DB3C29FC

From:	Sean Chittenden <sean(at)chittenden(dot)org>
To:	Neil Conway <neilc(at)samurai(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>, PostgreSQL Performance <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: [COMMITTERS] pgsql-server/ /configure /configure.in rc/incl ...
Date:	2003-03-07 06:04:12
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-committers pgsql-performance

> > I don't have my copy of Steven's handy (it's some 700mi away atm
> > otherwise I'd cite it), but if Tom or someone else has it handy, look
> > up the example re: the performance gain from read()'ing an mmap()'ed
> > file versus a non-mmap()'ed file. The difference is non-trivial and
> > _WELL_ worth the time given the speed increase.
>
> Can anyone confirm this? If so, one easy step we could take in this
> direction would be adapting COPY FROM to use mmap().

Weeee! Alright, so I got to have some fun writing out some simple
tests with mmap() and friends tonight. Are the results interesting?
Absolutely! Is this a simple benchmark? Yup. Do I think it
simulates PostgreSQL? Eh, not particularly. Does it demonstrate that
mmap() is a win and something worth implementing? I sure hope so. Is
this a test program to demonstrate the ideal use of mmap() in
PostgreSQL? No. Is it a place to start a factual discussion? I hope
so.

I have here four tests that are conditionalized by cpp.

# The first one uses read() and write() but with the buffer size set
# to the same size as the file.
gcc -O3 -finline-functions -fkeep-inline-functions -funroll-loops -o test-mmap test-mmap.c
/usr/bin/time ./test-mmap > /dev/null
Beginning tests with file: services

Page size: 4096
File read size is the same as the file size
Number of iterations: 100000
Start time: 1047013002.412516
Time: 82.88178

Completed tests
82.09 real 2.13 user 68.98 sys

# The second one uses read() and write() with the default buffer size:
# 65536
gcc -O3 -finline-functions -fkeep-inline-functions -funroll-loops -DDEFAULT_READSIZE=1 -o test-mmap test-mmap.c
/usr/bin/time ./test-mmap > /dev/null
Beginning tests with file: services

Page size: 4096
File read size is default read size: 65536
Number of iterations: 100000
Start time: 1047013085.16204
Time: 18.155511

Completed tests
18.16 real 0.90 user 14.79 sys
# Please note this is significantly faster, but that's expected

# The third test uses mmap() + madvise() + write()
gcc -O3 -finline-functions -fkeep-inline-functions -funroll-loops -DDEFAULT_READSIZE=1 -DDO_MMAP=1 -o test-mmap test-mmap.c
/usr/bin/time ./test-mmap > /dev/null
Beginning tests with file: services

Page size: 4096
File read size is the same as the file size
Number of iterations: 100000
Start time: 1047013103.859818
Time: 8.4294203644

Completed tests
7.24 real 0.41 user 5.92 sys
# Faster still, and twice as fast as the normal read() case

# The last test only calls mmap()'s once when the file is opened and
# only msync()'s, munmap()'s, close()'s the file once at exit.
gcc -O3 -finline-functions -fkeep-inline-functions -funroll-loops -DDEFAULT_READSIZE=1 -DDO_MMAP=1 -DDO_MMAP_ONCE=1 -o test-mmap test-mmap.c
/usr/bin/time ./test-mmap > /dev/null
Beginning tests with file: services

Page size: 4096
File read size is the same as the file size
Number of iterations: 100000
Start time: 1047013111.623712
Time: 1.174076

Completed tests
1.18 real 0.09 user 0.92 sys
# Substantially faster

Obviously this isn't perfect, but reading and writing data is faster
(specifically moving pages through the VM/OS). Doing partial writes
from mmap()'ed data should be faster along with scanning through
mmap()'ed portions of - or completely mmap()'ed - files because the
pages are already loaded in the VM. PostgreSQL's LRU file descriptor
cache could easily be adjusted to add mmap()'ing of frequently
accessed files (specifically, system catalogs come to mind). It's not
hard to figure out how often particular files are accessed and to
either _avoid_ mmap()'ing a file that isn't accessed often, or to
mmap() files that _are_ accessed often. mmap() does have a cost, but
I'd wager that mmap()'ing the same file a second or third time from a
different process would be more efficient. The speedup of searching
through an mmap()'ed file may be worth it, however, to mmap() all
files if the system is under a tunable resource limit
(max_mmaped_bytes?).

If someone is so inclined or there's enough interest, I can reverse
this test case so that data is written to an mmap()'ed file, but the
same performance difference should hold true (assuming this isn't a
write to a tape drive ::grin::).

The URL for the program used to generate the above tests is at:

http://people.freebsd.org/~seanc/mmap_test/

Please ask if you have questions. -sc

--
Sean Chittenden

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Sean Chittenden <sean(at)chittenden(dot)org>
Cc:	Neil Conway <neilc(at)samurai(dot)com>, Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>, PostgreSQL Performance <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: [COMMITTERS] pgsql-server/ /configure /configure.in rc/incl ...
Date:	2003-03-07 14:29:46
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-committers pgsql-performance

Sean Chittenden <sean(at)chittenden(dot)org> writes:
> Absolutely! Is this a simple benchmark? Yup. Do I think it
> simulates PostgreSQL? Eh, not particularly.

This would be on what OS? What hardware? What size test file?
Do the "iterations" mean so many reads of the entire file, or
so many buffer-sized read requests? Did the mmap case actually
*read* anything, or just map and unmap the file?

Also, what did you do to normalize for the effects of the test file
being already in kernel disk cache after the first test?

regards, tom lane

From:	Sean Chittenden <sean(at)chittenden(dot)org>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Neil Conway <neilc(at)samurai(dot)com>, Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>, PostgreSQL Performance <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: [COMMITTERS] pgsql-server/ /configure /configure.in rc/incl ...
Date:	2003-03-07 21:46:30
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-committers pgsql-performance

> > Absolutely! Is this a simple benchmark? Yup. Do I think it
> > simulates PostgreSQL? Eh, not particularly.

I think quite a few of these Q's would have been answered by reading
the code/Makefile....

> This would be on what OS?

FreeBSD, but it shouldn't matter. Any reasonably written VM should
have similar numbers (though BSD is generally regarded as having the
best VM, which, I think Linux poached not that long ago, iirc
::grimace::).

> What hardware?

My ultra-pathetic laptop with some fine - overly-noisy and can hardly
buildworld - IDE drives.

> What size test file?

In this case, only 72K. I've just updated the test program to use an
array of files though.

> Do the "iterations" mean so many reads of the entire file, or so
> many buffer-sized read requests?

In some cases, yes. With the file mmap()'ed, sorta. One of the test
cases (the one that did it in ~8s), mmap()'ed and munmap()'ed the file
every iteration and was twice as fast as the vanilla read() call.

> Did the mmap case actually *read* anything, or just map and unmap
> the file?

Nope, read it and wrote it out to stdout (which was redirected to
/dev/null).

> Also, what did you do to normalize for the effects of the test file
> being already in kernel disk cache after the first test?

That honestly doesn't matter too much since I wasn't testing the rate
of reading in files from my hard drive, only the OS's ability to
read/write pages of data around. In any case, I've updated my test
case to iterate through an array of files instead of just reading in a
copy of /etc/services. My laptop is generally a poor benchmark for
disk read performance given it takes 8hrs to buildworld, over 12hrs to
build mozilla, 18 for KDE, and about 48hrs for Open Office. :)
Someone with faster disks may want to try this and report back, but it
doesn't matter much in terms of relevancy for considering the benefits
of mmap(). The point is that there are calls that can be used that
substantially speed up read()'s and write()'s by allowing the VM to
align pages of data and give hints about its usage. For the sake of
argument re: the previously done tests, I'll reverse the order in
which I ran them and I bet dime to dollar that the times will be
identical.

% make ~/open_source/mmap_test
cp -f /etc/services ./services
gcc -O3 -finline-functions -fkeep-inline-functions -funroll-loops -DDEFAULT_READSIZE=1 -DDO_MMAP=1 -DDO_MMAP_ONCE=1 -o mmap-test mmap-test.c
/usr/bin/time ./mmap-test > /dev/null
Beginning tests with file: services

Page size: 4096
File read size is the same as the file size
Number of iterations: 100000
Start time: 1047064672.276544
Time: 1.281477

Completed tests
1.29 real 0.10 user 0.92 sys
gcc -O3 -finline-functions -fkeep-inline-functions -funroll-loops -DDEFAULT_READSIZE=1 -DDO_MMAP=1 -o mmap-test mmap-test.c
/usr/bin/time ./mmap-test > /dev/null
Beginning tests with file: services

Page size: 4096
File read size is the same as the file size
Number of iterations: 100000
Start time: 1047064674.266191
Time: 7.486622

Completed tests
7.49 real 0.41 user 6.01 sys
gcc -O3 -finline-functions -fkeep-inline-functions -funroll-loops -DDEFAULT_READSIZE=1 -o mmap-test mmap-test.c
/usr/bin/time ./mmap-test > /dev/null
Beginning tests with file: services

Page size: 4096
File read size is default read size: 65536
Number of iterations: 100000
Start time: 1047064682.288637
Time: 19.35214

Completed tests
19.04 real 0.88 user 15.43 sys
gcc -O3 -finline-functions -fkeep-inline-functions -funroll-loops -o mmap-test mmap-test.c
/usr/bin/time ./mmap-test > /dev/null
Beginning tests with file: services

Page size: 4096
File read size is the same as the file size
Number of iterations: 100000
Start time: 1047064701.867031
Time: 82.4294540875

Completed tests
81.57 real 2.10 user 69.55 sys

Here's the updated test that iterates through. Ooh! One better, the
files I've used are actual data files from ~pgsql. The new benchmark
iterates through the list of files and and calls bench() once for each
file and restarts at the first file after reaching the end of its
list (ARGV).

Whoa, if these tests are even close to real world, then we at the very
least should be mmap()'ing the file every time we read it (assuming
we're reading more than just a handful of bytes):

find /usr/local/pgsql/data -type f | /usr/bin/xargs /usr/bin/time ./mmap-test > /dev/null
Page size: 4096
File read size is the same as the file size
Number of iterations: 100000
Start time: 1047071143.463360
Time: 12.109530

Completed tests
12.11 real 0.36 user 6.80 sys

find /usr/local/pgsql/data -type f | /usr/bin/xargs /usr/bin/time ./mmap-test > /dev/null
Page size: 4096
File read size is default read size: 65536
Number of iterations: 100000
.... [been waiting here for >40min now....]

Ah well, if these tests finish this century, I'll post the results in
a bit, but it's pretty clearly a win. In terms of the data that I'm
copying, I'm copying ~700MB of data from my test DB on my laptop. I
only have 256MB of RAM so I can pretty much promise you that the data
isn't in my system buffers. If anyone else would like to run the
tests or look at the results, please check it out:

o1 and o2 should be the only targets used if FILES is bigger than the
RAM on the system. o3's by far and away the fastest, but only in rare
cases will a DBA have more RAM than data. But, as mentioned earlier,
the LRU cache could easily be modified to munmap() infrequently
accessed files to keep the size of mmap()'ed data down to a reasonable
level.

The updated test programs are at:

http://people.FreeBSD.org/~seanc/mmap_test/

-sc

--
Sean Chittenden

From:	"Marc G(dot) Fournier" <scrappy(at)hub(dot)org>
To:	Sean Chittenden <sean(at)chittenden(dot)org>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>, pgsql-committers(at)postgresql(dot)org
Subject:	Re: pgsql-server/ /configure /configure.in rc/incl ...
Date:	2003-03-11 02:50:59
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-committers pgsql-performance

yOn Thu, 6 Mar 2003, Sean Chittenden wrote:

> > >>> Has anyone ever thought about adding kqueue (for *BSD) support to
> > >>> Postgres, instead of using select?
> > >>
> > >> Why? poll() is standard. kqueue isn't, AFAIK.
> >
> > > It's supposed be a whole heap faster - there is no polling involved...
> >
> > Supposed by whom? Faster than what? And how would it not poll?
> >
> > The way libpq uses this call, it's either probing for current status
> > (timeout=0) or it's willing to block, possibly indefinitely, until the
> > desired condition arises. It does not sit there in a busy-wait loop.
> > I can't see any reason to think that an OS-specific API would give
> > any marked difference in performance.
>
> Heh, kqueue is _the_ reason to use FreeBSD.
>
> http://www.kegel.com/dkftpbench/Poller_bench.html#results
>
> I've toyed with the idea of adding this because it is monstrously more
> efficient than select()/poll() in basically every way, shape, and
> form.

I would personally be interested in seeing patches ... what would be
involved?

From:	Sean Chittenden <sean(at)chittenden(dot)org>
To:	"Marc G(dot) Fournier" <scrappy(at)hub(dot)org>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>, pgsql-committers(at)postgresql(dot)org
Subject:	Re: pgsql-server/ /configure /configure.in rc/incl ...
Date:	2003-03-11 04:11:33
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-committers pgsql-performance

> > Heh, kqueue is _the_ reason to use FreeBSD.
> >
> > http://www.kegel.com/dkftpbench/Poller_bench.html#results
> >
> > I've toyed with the idea of adding this because it is monstrously more
> > efficient than select()/poll() in basically every way, shape, and
> > form.
>
> I would personally be interested in seeing patches ... what would be
> involved?

Whoa! Surprisingly, much less than I expected!!! A small shim would
have to be put in place to abstract away returning valid file
descriptors that are ready to be read()/write(). What's really cool,
is there are only a handful of places that'd have to be updated (as
far as I can tell):

src/backend/access/transam/xact.c
src/backend/postmaster/pgstat.c
src/backend/postmaster/postmaster.c
src/backend/storage/lmgr/s_lock.c
src/bin/pg_dump/pg_dump.c
src/interfaces/libpq/fe-misc.c

Then it'd be possible to have clients/servers switch between kqueue,
poll, select, or whatever the new flavor of alerting from available IO
fd's. I've added it to my personal TODO list of things to work on.
If someone beats me to it, cool, it's just something that one day I'll
get to (hopefully). -sc

--
Sean Chittenden

From:	"Christopher Kings-Lynne" <chriskl(at)familyhealth(dot)com(dot)au>
To:	"Sean Chittenden" <sean(at)chittenden(dot)org>, "Marc G(dot) Fournier" <scrappy(at)hub(dot)org>
Cc:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, <pgsql-committers(at)postgresql(dot)org>
Subject:	Re: pgsql-server/ /configure /configure.in rc/incl ...
Date:	2003-03-11 04:17:46
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-committers pgsql-performance

> > I would personally be interested in seeing patches ... what would be
> > involved?
>
> Whoa! Surprisingly, much less than I expected!!! A small shim would
> have to be put in place to abstract away returning valid file
> descriptors that are ready to be read()/write(). What's really cool,
> is there are only a handful of places that'd have to be updated (as
> far as I can tell):

It would be nice to have this support there, however Tom was correct in
saying it really only applies to network apps that are handling thousands of
connections, all really, really fast. Postgres doesn't. I say you'd have
to do the work, then do the benchmarking to see if it makes a difference.

Chris

From:	Neil Conway <neilc(at)samurai(dot)com>
To:	Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>
Cc:	Sean Chittenden <sean(at)chittenden(dot)org>, "Marc G(dot) Fournier" <scrappy(at)hub(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-committers(at)postgresql(dot)org
Subject:	Re: pgsql-server/ /configure /configure.in rc/incl ...
Date:	2003-03-11 04:42:35
Message-ID:	1047357755.357.1.camel@tokyo
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-committers pgsql-performance

On Mon, 2003-03-10 at 23:17, Christopher Kings-Lynne wrote:
> It would be nice to have this support there, however Tom was correct in
> saying it really only applies to network apps that are handling thousands of
> connections, all really, really fast. Postgres doesn't. I say you'd have
> to do the work, then do the benchmarking to see if it makes a difference.

... and if it doesn't make a significant difference, I'd oppose
including it in the mainline source. Performance optimization is one
thing; performance "optimization" that doesn't actually improve
performance is another :-)

Cheers,

Neil
--
Neil Conway <neilc(at)samurai(dot)com> || PGP Key ID: DB3C29FC

From:	"Christopher Kings-Lynne" <chriskl(at)familyhealth(dot)com(dot)au>
To:	"Neil Conway" <neilc(at)samurai(dot)com>
Cc:	"Sean Chittenden" <sean(at)chittenden(dot)org>, "Marc G(dot) Fournier" <scrappy(at)hub(dot)org>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, <pgsql-committers(at)postgresql(dot)org>
Subject:	Re: pgsql-server/ /configure /configure.in rc/incl ...
Date:	2003-03-11 04:53:01
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-committers pgsql-performance

> > It would be nice to have this support there, however Tom was correct in
> > saying it really only applies to network apps that are handling
thousands of
> > connections, all really, really fast. Postgres doesn't. I say you'd
have
> > to do the work, then do the benchmarking to see if it makes a
difference.
>
> ... and if it doesn't make a significant difference, I'd oppose
> including it in the mainline source. Performance optimization is one
> thing; performance "optimization" that doesn't actually improve
> performance is another :-)

That was the unsaid implication... :)

Chris

From:	Sean Chittenden <sean(at)chittenden(dot)org>
To:	Neil Conway <neilc(at)samurai(dot)com>
Cc:	Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>, "Marc G(dot) Fournier" <scrappy(at)hub(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-committers(at)postgresql(dot)org
Subject:	Re: pgsql-server/ /configure /configure.in rc/incl ...
Date:	2003-03-11 04:56:10
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-committers pgsql-performance

> > It would be nice to have this support there, however Tom was
> > correct in saying it really only applies to network apps that are
> > handling thousands of connections, all really, really fast.
> > Postgres doesn't. I say you'd have to do the work, then do the
> > benchmarking to see if it makes a difference.
>
> ... and if it doesn't make a significant difference, I'd oppose
> including it in the mainline source. Performance optimization is one
> thing; performance "optimization" that doesn't actually improve
> performance is another :-)

::sigh:: Well, I'm not about to argue one way or another on this
beyond saying: kqueue is better than select/poll, but there are much
bigger, much lower, and much easier pieces of fruit to pick off the
optimization tree given the cost/benefit for the amount of network IO
PostgreSQL does. That said, what was the performance gain of moving
from select() to poll()? It wasn't the biggest optimization in
PostgreSQL history, nor the smallest, but it was a step forward. -sc

--
Sean Chittenden

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Sean Chittenden <sean(at)chittenden(dot)org>
Cc:	Neil Conway <neilc(at)samurai(dot)com>, Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>, "Marc G(dot) Fournier" <scrappy(at)hub(dot)org>, pgsql-committers(at)postgresql(dot)org
Subject:	Re: pgsql-server/ /configure /configure.in rc/incl ...
Date:	2003-03-11 05:06:14
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-committers pgsql-performance

Sean Chittenden <sean(at)chittenden(dot)org> writes:
> That said, what was the performance gain of moving
> from select() to poll()? It wasn't the biggest optimization in
> PostgreSQL history, nor the smallest, but it was a step forward. -sc

That change was not sold as a performance improvement; I doubt that it
is one. It was sold as not failing when libpq runs inside an
application that has thousands of open files (i.e., more than select()
can cope with). "Faster" is debatable, "fails" is not...

regards, tom lane

From:	Sean Chittenden <sean(at)chittenden(dot)org>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Neil Conway <neilc(at)samurai(dot)com>, Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>, "Marc G(dot) Fournier" <scrappy(at)hub(dot)org>, pgsql-committers(at)postgresql(dot)org
Subject:	Re: pgsql-server/ /configure /configure.in rc/incl ...
Date:	2003-03-11 05:30:33
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-committers pgsql-performance

> > That said, what was the performance gain of moving from select()
> > to poll()? It wasn't the biggest optimization in PostgreSQL
> > history, nor the smallest, but it was a step forward. -sc
>
> That change was not sold as a performance improvement; I doubt that
> it is one. It was sold as not failing when libpq runs inside an
> application that has thousands of open files (i.e., more than
> select() can cope with). "Faster" is debatable, "fails" is not...

Well, I've only heard through 2nd hand sources (dillion) the kind of
hellish conditions that Mark has on his boxen, but "faster and more
efficient in the kernel" is "faster and more efficient in the kernel"
no matter how 'ya slice it and I know that every last bit helps a
loaded system.

I'm not stating that most people, or even 90% of people, will notice.
Hopefully 100% of the universe runs their boxen under ideal conditions
(like most databases should, right? ::wink wink, nudge nudge:: For
those that don't, however, and get to watch things run in the red with
a load average over 20, the use of kqueue or more efficient system
calls is likely very appreciated. -sc

--
Sean Chittenden