Quick Links

Re: test data for query optimizer

Lists:	pgsql-hackers

From:	nconway(at)klamath(dot)dyndns(dot)org (Neil Conway)
To:	PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	test data for query optimizer
Date:	2002-07-12 16:05:41
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

I'd like to look at the performance of the query optimizer (both the
traditional one and GEQO) when joining large numbers of tables: 10-15,
or more. In order to do that (and to get meaningful results), I'll
need to work with some data that actually requires joins of that
magnitude. Ideally, I'd like the data to be somewhat realistic -- so
that the performance I'm seeing will reflect the performance a typical
user might see. (i.e. I don't want an artificial benchmark)

However, I don't possess any data of that nature, and I'm unsure
where I can find some (or how to generate some of my own). Does
anyone know of:

- a freely available collection of data that requires queries
of this type, and is reasonably representative of "real world"
applications

- or, a means to generate programatically some data that
fits the above criteria.

Thanks in advance,

Neil

--
Neil Conway <neilconway(at)rogers(dot)com>
PGP Key ID: DB3C29FC

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	nconway(at)klamath(dot)dyndns(dot)org (Neil Conway)
Cc:	PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: test data for query optimizer
Date:	2002-07-12 17:14:56
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

nconway(at)klamath(dot)dyndns(dot)org (Neil Conway) writes:
> I'd like to look at the performance of the query optimizer (both the
> traditional one and GEQO) when joining large numbers of tables: 10-15,
> or more. In order to do that (and to get meaningful results), I'll
> need to work with some data that actually requires joins of that
> magnitude.

The easiest way to construct a realistic many-way join is to use a star
schema. Here you have a primary "fact table" that includes a lot of
columns that individually join to the primary keys of other "detail
tables". For example, you might have a column "State" in the fact table
with values like "PA", "NY", etc, and you want to join it to a table
states(abbrev,fullname,...) so your query can display "Pennsylvania",
"New York", etc. It's easy to make up realistic examples that involve
any number of joins.

This is of course only one usage pattern for lots-o-joins, so don't put
too much credence in it alone as a benchmark, but it's certainly a
widely used pattern.

Searching for "star schema" at Google turned up some interesting things
last time I tried it.

regards, tom lane

From:	"Christopher Kings-Lynne" <chriskl(at)familyhealth(dot)com(dot)au>
To:	"PostgreSQL Hackers" <pgsql-hackers(at)postgresql(dot)org>, "Neil Conway" <nconway(at)klamath(dot)dyndns(dot)org>
Subject:	Re: test data for query optimizer
Date:	2002-07-13 03:18:14
Message-ID:	012e01c22a1b$ed5cf110$0200a8c0@SOL
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

What about the OSDB benchmark? Does that contain a large dataset?

Chris

----- Original Message -----
From: "Neil Conway" <nconway(at)klamath(dot)dyndns(dot)org>
To: "PostgreSQL Hackers" <pgsql-hackers(at)postgresql(dot)org>
Sent: Saturday, July 13, 2002 12:05 AM
Subject: [HACKERS] test data for query optimizer

> I'd like to look at the performance of the query optimizer (both the
> traditional one and GEQO) when joining large numbers of tables: 10-15,
> or more. In order to do that (and to get meaningful results), I'll
> need to work with some data that actually requires joins of that
> magnitude. Ideally, I'd like the data to be somewhat realistic -- so
> that the performance I'm seeing will reflect the performance a typical
> user might see. (i.e. I don't want an artificial benchmark)
>
> However, I don't possess any data of that nature, and I'm unsure
> where I can find some (or how to generate some of my own). Does
> anyone know of:
>
> - a freely available collection of data that requires queries
> of this type, and is reasonably representative of "real world"
> applications
>
> - or, a means to generate programatically some data that
> fits the above criteria.
>
> Thanks in advance,
>
> Neil
>
> --
> Neil Conway <neilconway(at)rogers(dot)com>
> PGP Key ID: DB3C29FC
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo(at)postgresql(dot)org
>

From:	Mark kirkwood <markir(at)slingshot(dot)co(dot)nz>
To:	Neil Conway <nconway(at)klamath(dot)dyndns(dot)org>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: test data for query optimizer
Date:	2002-07-13 04:00:40
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

What about the TPC-H benchmark ?

I cant recall if it has more than 10 tables, but it seemed like the
queries were "quite good" for a benchmark. In addition it comes with a
data generator.

regards

Mark
>On Sat, 2002-07-13 at 04:05, Neil Conway wrote:
> I'd like to look at the performance of the query optimizer (both the
> traditional one and GEQO) when joining large numbers of tables: 10-15,
> or more. In order to do that (and to get meaningful results), I'll
> need to work with some data that actually requires joins of that
> magnitude. Ideally, I'd like the data to be somewhat realistic -- so
> that the performance I'm seeing will reflect the performance a typical
> user might see. (i.e. I don't want an artificial benchmark)
>
>

From:	nconway(at)klamath(dot)dyndns(dot)org (Neil Conway)
To:	Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: test data for query optimizer
Date:	2002-07-13 04:06:47
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Jul 13, 2002 at 11:18:14AM +0800, Christopher Kings-Lynne wrote:
> What about the OSDB benchmark? Does that contain a large dataset?

No -- it only uses 5 relations total, with the most complex query
only involving 4 joins.

Cheers,

Neil

--
Neil Conway <neilconway(at)rogers(dot)com>
PGP Key ID: DB3C29FC