Chapter 2 and 3: basic Data handling koop

C H A P T E R
Basic data handling
2
This chapter introduces the basics of economic data handling. It focusses on four
important areas: (1) the types of data that economists often use; (2) a brief discus-
sion of the sources from which economists obtain data;1
(3) an illustration of the
types of graphs that are commonly used to present information in a data set; and (4)
a discussion of simple numerical measures, or descriptive statistics, often presented
to summarize key aspects of a data set.
Types of economic data
This section introduces common types of data and deﬁnes the terminology associ-
ated with their use.
Time series data
Macroeconomic data measures phenomena such as real gross domestic product
(denoted GDP), interest rates, the money supply, etc. This data is collected at speciﬁc
points in time (e.g. yearly). Financial data, on the other hand, measures phenomena
such as changes in the price of stocks. This type of data is collected more frequently
than the above, for instance, daily or even hourly. In all of these examples, the
data are ordered by time and are referred to as time series data. The underlying
phenomenon which we are measuring (e.g. GDP or wages or interest rates, etc.) is
referred to as a variable. Time series data can be observed at many frequencies.
Commonly used frequencies are: annual (i.e. a variable is observed every year),
quarterly (i.e. four times a year), monthly, weekly or daily.

In this book, we will use the notation Yt to indicate an observation on variable Y
(e.g. real GDP) at time t. A series of data runs from period t = 1 to t = T. “T ” is used
to indicate the total number of time periods covered in a data set. To give an example,
if we were to use post-war annual real GDP data from 1946–1998 – a period of 53
years – then t = 1 would indicate 1946, t = 53 would indicate 1998 and T = 53 the
total number of years. Hence, Y1 would be real GDP in 1946, Y2 real GDP for 1947,
etc. Time series data is typically presented in chronological order.
Working with time series data often requires some special tools, which are discussed
in Chapters 8–11.
Cross-sectional data
In contrast to the above, micro- and labor economists often work with data that is
characterized by individual units. These units might refer to people, companies or
countries. A common example is data pertaining to many different people within a
group, such as the wage of all people in a certain company or industry. With such
cross-sectional data, the ordering of the data typically does not matter (unlike time
series data).
In this book, we use the notation Yi to indicate an observation on variable Y for
individual i. Observations in a cross-sectional data set run from individual i = 1 to N.
By convention, N indicates the number of cross-sectional units (e.g. the number of
people surveyed). For instance, a labor economist might wish to survey N = 1,000
workers in the steel industry, asking each individual questions such as how much they
make or whether they belong to a union. In this case, Y1 will be equal to the wage
(or union membership) reported by the first worker, Y2 the wage (or union mem-
bership) reported by the second worker, and so on.
Similarly, a microeconomist may ask N = 100 representatives from manufacturing
companies about their profit figures in the last month. In this case, Y1 will equal the
profit reported by the first company, Y2 the profit reported by the second company,
through to Y100, the profit reported by the 100th company.
The distinction between qualitative and quantitative data
The previous data sets can be used to illustrate an important distinction between types
of data. The microeconomist’s data on sales will have a number corresponding to
each firm surveyed (e.g. last month’s sales in the first company surveyed were
£20,000). This is referred to as quantitative data.
The labor economist, when asking whether or not each surveyed employee belongs
to a union, receives either a Yes or a No answer. These answers are referred to as
qualitative data. Such data arise often in economics when choices are involved
(e.g. the choice to buy or not buy a product, to take public transport or a private car,
to join or not to join a club).
10 Analysis of economic data

Economists will usually convert these qualitative answers into numeric data. For
instance, the labor economist might set Yes = 1 and No = 0. Hence, Y1 = 1 means
that the first individual surveyed does belong to a union, Y2 = 0 means that the second
individual does not. When variables can take on only the values 0 or 1, they are
referred to as dummy (or binary) variables. Working with such variables is a topic
that will be discussed in detail in Chapter 7.
Panel data
Some data sets will have both a time series and a cross-sectional component. This
data is referred to as panel data. Economists working on issues related to
growth often make use of panel data. For instance, GDP for many countries from
1950 to the present is available. A panel data set on Y = GDP for 12 European coun-
tries would contain the GDP value for each country in 1950 (N = 12 observations),
followed by the GDP for each country in 1951 (another N = 12 observations),
and so on. Over a period of T years, there would be T ¥ N observations on Y.
Alternatively, labor economists often work with large panel data sets created by
asking many individuals questions such as how much they make every year for several
years.
We will use the notation Yit to indicate an observation on variable Y for unit i at
time t. In the economic growth example, Y11 will be GDP in country 1, year 1, Y12
GDP for country 1 in year 2, etc. In the labor economics example, Y11 will be the
wage of the first individual surveyed in the first year, Y12 the wage of the first indi-
vidual surveyed in the second year, etc.
Data transformations: levels versus growth rates
In this book, we will mainly assume that the data of interest, Y, is directly available.
However, in practice, you may be required to take raw data from one source, and then
transform it into a different form for your empirical analysis. For instance, you may
take raw time series data on the variables W = total consumption expenditure, and
X = expenditure on food, and create a new variable: Y = the proportion of expendi-
ture devoted to food. Here the transformation would be Y = X/W. The exact nature
of the transformations required depends on the problem at hand, so it is hard to offer
any general recommendations on data transformation. Some special cases are con-
sidered in later chapters. Here it is useful to introduce one common transformation
that econometricians use with time series data.
To motivate this transformation, suppose we have annual data on real GDP for
1950–1998 (i.e. 49 years of data) denoted by Yt , for t = 1 to 49. In many empirical
projects, this might be the variable of primary interest. We will refer to such series as
the level of real GDP. However, people are often more interested in how the
economy is growing over time, or in real GDP growth. A simple way to measure
Basic data handling 11

growth is to take the real GDP series and calculate a percentage change for each year.
The percentage change in real GDP between period t and t + 1 is calculated accord-
ing to the formula:2
The percentage change in real GDP is often referred to as the growth of GDP or
the change in GDP. Time series data will be discussed in more detail in Chapters
8–11. It is sufficient for you to note here that we will occasionally distinguish between
the level of a variable and its growth rate, and that it is common to transform levels
data into growth rate data.
Index numbers
Many variables that economists work with come in the form of index numbers.
Appendix 2.1 at the end of this chapter provides a detailed discussion of what these
are and how they are calculated. However, if you just want to use an index number
in your empirical work, a precise knowledge of how to calculate indices is probably
unnecessary. Having a good intuitive understanding of how an index number is
interpreted is sufficient. Accordingly, here in the body of the text I provide only an
informal intuitive discussion of index numbers.
Suppose you are interested in studying a country’s inflation rate, which is a measure
of how prices change over time. The question arises as to how we measure “prices”
in a country. The price of an individual good (e.g. milk, oranges, electricity, a par-
ticular model of car, a pair of shoes, etc.) can be readily measured, but often interest
centers not on individual goods, but on the price level of the country as a whole.
The latter concept is usually defined as the price of a “basket” containing the sorts
of goods that a typical consumer might buy. The price of this basket is observed at
regular intervals over time in order to determine how prices are changing in the
country as a whole. But the price of the basket is usually not directly reported by the
government agency that collects such data. After all, if you are told the price of an
individual good (e.g. that an orange costs 35 pence), you have been told something
informative, but if you are told “the price of a basket of representative goods” is
£10.45, that statement is not very informative. To interpret this latter number, you
would have to know what precisely was in the basket and in what quantities. Given
the millions of goods bought and sold in a modern economy, far too much infor-
mation would have to be given.
In light of such issues, data often comes in the form of a price index. Indices may
be calculated in many ways, and it would distract from the main focus of this chapter
to talk in detail about how they are constructed (see Appendix 2.1 for more details).
% .change= +1Y Y
Y
t t
t
-( )
¥ 100

However, the following points are worth noting at the outset. Firstly, indices almost
invariably come as time series data. Secondly, one time period is usually chosen as a
base year and the price level in the base year is set to 100 (some indices set the base
year value to 1.00 instead of 100). Thirdly, price levels in other years are measured in
percentages relative to the base year.
An example will serve to clarify these issues. Suppose a price index for four years
exists, and the values are: Y1 = 100, Y2 = 106, Y3 = 109 and Y4 = 111. These numbers
can be interpreted as follows. The first year has been selected as a base year and,
accordingly, Y1 = 100. The figures for other years are all relative to this base year and
allow for a simple calculation of how prices have changed since the base year. For
instance, Y2 = 106 means that prices have risen from 100 to 106 – a 6% rise since the
first year. It can also be seen that prices have risen by 9% from year 1 to year 3 and
by 11% from year 1 to year 4. Since the percentage change in prices is the definition
of inflation, the price index allows the person looking at the data to easily see what
inflation is. In other words, you can think of a price index as a way of presenting
price data that is easy to interpret and understand.
A price index is very good for measuring changes in prices over time, but should
not be used to talk about the level of prices. For instance, it should not be inter-
preted as an indicator of whether prices are “high” or “low”. A simple example
illustrates why this is the case.
The US and Canada both collect data on consumer prices. Suppose that both coun-
tries decide to use 1988 as the base year for their respective price indices. This means
that the price index in 1988 for both countries will be 100. It does not mean that prices
were identical in both countries in 1988. The choice of 1988 as a base year is arbitrary; if
Canada were to suddenly change its choice of base year to 1987 then the indices in
1988 would no longer be the same for both countries. Price indices for the two coun-
tries cannot be used to make statements such as: “Prices are higher in Canada than
the US.” But they can also be used to calculate inflation rates. This allows us to make
statements of the type: “Inflation (i.e. price changes) is higher in Canada than the
US.”
Finance is another field where price indices often arise since information on stock
prices is often presented in this form. That is, commonly reported measures of stock
market activity such as the Dow Jones Industrial Average, the FTSE index and the
S&P500 are all price indexes.
In our discussion, we have focussed on price indices, and these are indeed by far
the most common type of index numbers. Note that other types of indices (e.g. quan-
tity indices) exist and should be interpreted in a similar manner to price indices. That
is, they should be used as a basis for measuring how phenomena have changed from
a given base year.
This discussion of index numbers is a good place to mention another transfor-
mation which is used to deal with the effects of inflation. As an example, consider
the most common measure of the output of an economy: gross domestic product

(GDP). GDP can be calculated by adding up the value of all goods produced in the
economy. However, in times of high inflation, simply looking at how GDP is chang-
ing over time can be misleading. If inflation is high, the prices of goods will be rising
and thus their value will be rising over time, even if the actual amount of goods
produced is not increasing. Since GDP measures the value of all goods, it will be
rising in high inflation times even if production is stagnant. This leads researchers to
want to correct for the effect of inflation. The way to do this is to divide the GDP
measure by a price index (in the case of GDP, the name given to the price index is
the GDP deflator). GDP transformed in this way is called real GDP. The original
GDP variable is referred to as nominal GDP. This distinction between real and
nominal variables is important in many fields of economics. The key things you
should remember are that a real variable is a nominal variable divided by a price vari-
able (usually a price index) and that real variables have the effects of inflation removed
from them.
The case where you wish to correct a growth rate for inflation is slightly different.
In this case, creating the real variable involves subtracting the change in the
price index from the nominal variable. So, for instance, real interest rates are nominal
interest rates minus inflation (where inflation is defined as the change in the price
index).
Obtaining data
All of the data you need in order to understand the basic concepts and to carry out
the simple analyses covered in this book can be downloaded from the website asso-
ciated with this book. However, in the future you may need to gather your own data
for an essay, dissertation or report. Economic data come from many different sources
and it is hard to offer general comments on the collection of data. Below are a
few key points that you should note about common data sets and where to find
them.
Most macroeconomic data is collected through a system of national accounts,
made available in printed and, increasingly, digital form in university and government
libraries. Microeconomic data is usually collected by surveys of households, employ-
ment and businesses, which are often available from the same sources.
It is becoming increasingly common for economists to obtain their data over the
Internet, and many relevant World Wide Web (www) sites now exist from which data
can be downloaded. You should be forewarned that the Internet is a rapidly growing
and changing place, so that the information and addresses provided here might soon
be outdated. Appropriately, this section is provided only to give an indication of what
can be obtained over the Internet, and as such is far from complete. For a more
detailed description of what is available on the Internet and how to access it, you may
wish to consult Computing Skills for Economists by Guy Judge.

Before you begin searching, you should also note that some sites allow users to
access data for free while others charge for their data sets. Many will provide free data
to non-commercial (e.g. university) users, the latter requiring that you register before
being allowed access to the data.
An extremely useful American site is “Resources for Economists on the Internet”
(http://rfe.wustl.edu/EconFAQ.html). This site contains all sorts of interesting
material on a wide range of economic topics. You should take the time to explore it.
This site also provides links to many different data sources. Another site with useful
links is the National Bureau of Economic Research (http://www.nber.org/). One
good data source available through this site is the Penn World Table (PWT), which
gives macroeconomic data for over 100 countries for many years. We will refer to the
PWT later in the chapter.
In the United Kingdom, MIMAS (Manchester Information & Associated Services)
is a useful gateway to many data sets (http://www.mimas.ac.uk). This site currently
requires a registration process.
It is worth noting that data on the Internet is often simply listed on the screen.
You can, of course, always copy the data down by hand and then type it into Excel.
But it is far less time-consuming either to save the data in a file (using File/Save as)
or to highlight the data, copy it to a clipboard, and then paste it into Excel.
To give you a flavor for the kinds of data sets available from the Internet, and what
Internet sites look like, we will focus on a common US website and a common UK
site.
Example: Resources for Economists on the Internet
If you follow the link labeled “Data” on the “Resources for Economists on the
Internet” website, the following page appears on the screen.
Data
US Macro and Regional Data (data for the US economy and its regions)
Other US Data (other types of US data)
World and Non-US Data (data from around the world)
Finance and Financial Markets (data from financial markets)
Journal Data and Program Archives (academic journal archives)
If you click on any of the links (indicated here in italics), you obtain a listing of
numerous additional Internet links you can connect to containing various types
of data.

Example: MIMAS
The following material was taken directly from the MIMAS website. Of course,
you will not understand what all the titles below mean. But a few key abbrevi-
ations are: ONS = Office of National Statistics (the main UK government data
source), IMF = International Monetary Fund (which collects data from many
countries including developing countries), and OECD = Organisation for Eco-
nomic Co-operation and Development (which collects data for industrialized
countries).
Census and related data sets
Census information gateway
1991 Local Base and Small Area Statistics – download a registration pack.
Special workplace and migration statistics
1991 Samples of Anonymized Records
1981 Small Area Statistics
1991 Census Digitized Boundary Data
1981 Digitized Boundary Data
Table 100
Census Monitor county/district tables
Topic Statistics
Population Surface Models
Estimating with Confidence data
ONS ward and district level classifications
GB Profiler
The Longitudinal Study
1971/81 Change File
Postcode to ED/OA Directories
Central Postcode Directory POSTZON File
ONS Vital Statistics for Wards
Government and other continuous surveys
General Household Survey
Labour Force Survey
Quarterly Labour Force Survey
Family Expenditure Survey
Family Resources Survey
Farm Business Survey
National Child Development Study
British Household Panel Study
Health Survey for England

Macro-economic time series databanks
ONS Time Series Databank
OECD Main Economic Indicators
UNIDO Industrial Statistics 3 digit level of ISIC code
UNIDO Industrial Statistics 4 digit level of ISIC code
UNIDO Commodity Balance Statistics Database
IMF International Financial Statistics
IMF Direction of Trade Statistics
IMF Balance of Payments Statistics
IMF Government Finance Statistics Yearbook
Note, however, that only registered users can obtain access to any of these data
sets.
Many of the data sets described above are free. Furthermore, most university
libraries or computer centers subscribe to various databases which the student
can use. You are advised to check with your own university library or computer
center to see what data sets you have access to. In the field of finance, there are
many excellent databases of stock prices and accounting information for all sorts
of companies for many years. Unfortunately, these tend to be very expensive and,
hence, you should see whether your university has a subscription to a financial
database. Two of the more popular ones are Datastream by Thomson
Financial (http://www.datastream.com/) and Wharton Research Data Services
(http://wrds.wharton.upenn.edu/). With regards to free data, a more limited
choice of financial data is available through popular Internet ports such as Yahoo!
(http://finance.yahoo.com). The Federal Reserve Bank of St Louis also maintains
a free database with a wide variety of data, including some financial time series
(http://research.stlouisfed.org/fred2/).
Many specialist fields also have freely available data on the web. For instance, in
the field of sports economics and statistics there are many excellent data sets avail-
able free or for a nominal charge. For baseball, http://www.baseball1.com is a very
comprehensive data set. The Statistics in Sports Section of the American Statistical
Association also has a very useful website containing links to data sets for many sports
(http://www.amstat.org/sections/sis/sports.html). The general advice I want
to give here is that spending some time searching the Internet can often be very
fruitful.

Working with data: graphical methods
Once you have your data, it is important for you to summarize it. After all, anybody
who reads your work will not be interested in the dozens or – more likely – hundreds
or more observations contained in the original raw data set. Indeed, you can think of
the whole field of econometrics as one devoted to the development and dissemina-
tion of methods whereby information in data sets is summarized in informative ways.
Charts and tables are very useful ways of presenting your data. There are many dif-
ferent types (e.g. bar chart, pie chart, etc.). A useful way to learn about the charts is
to experiment with the ChartWizard©
in Excel. In this section, we will illustrate a few
of the commonly used types of charts.
Since most economic data is either in time series or cross-sectional form, we will
briefly introduce simple techniques for graphing both types of data.
Time series graphs
Monthly time series data from January 1947 through October 1996 on the UK
pound/US dollar exchange rate is plotted using the “Line Chart” option in Excel’s
ChartWizard©
in Figure 2.1 (this data is located in Excel file EXRUK.XLS). Such charts
are commonly referred to as time series graphs. The data set contains 598 obser-
vations – far too many to be presented as raw numbers for a reader to comprehend.
However, a reader can easily capture the main features of the data by looking at the
chart. One can see, for instance, the attempt by the UK government to hold the
exchange rate fixed until the end of 1971 (apart from large devaluations in Septem-
ber 1949 and November 1967) and the gradual depreciation of the pound as it floated
downward through the middle of the 1970s.
0
50
100
150
200
250
300
350
400
450
Jan–47
Jan–49
Jan–51
Jan–53
Jan–55
Jan–57
Jan–59
Jan–61
Jan–63
Jan–65
Jan–67
Jan–69
Jan–71
Jan–73
Jan–75
Jan–77
Jan–79
Jan–81
Jan–83
Jan–85
Jan–87
Jan–89
Jan–91
Jan–93
Jan–95
Date
Penceperdollar
Fig. 2.1 Time series graph of UK pound/US dollar exchange rate.

Exercise 2.1
(a) Recreate Figure 2.1.
(b) File INCOME.XLS contains data on the natural logarithm of personal income
and consumption in the US from 1954Q1 to 1994Q2. Make one time series
graph that contains both of these variables. (Note that 1954Q1 means the
first quarter (i.e. January, February and March) of 1954.)
(c) Transform the logged personal income data to growth rates. Note that the
percentage change in personal income between period t - 1 and t is approxi-
mately 100 ¥ [ln(Yt) - ln(Yt-1)] and the data provided in INCOME.XLS is
already logged. Make a time series graph of the series you have created.
Histograms
With time series data, a chart that shows how a variable evolves over time is often
very informative. However, in the case of cross-sectional data, such methods are not
appropriate and we must summarize the data in other ways.
Excel file GDPPC.XLS contains cross-sectional data on real GDP per capita in 1992
for 90 countries from the PWT. Real GDP per capita in every country has been con-
verted into US dollars using purchasing power parity exchange rates. This allows us
to make direct comparisons across countries.
One convenient way of summarizing this data is through a histogram. To con-
struct a histogram, begin by constructing class intervals or bins that divide the coun-
tries into groups based on their GDP per capita. In our data set, GDP per person
varies from $408 in Chad to $17,945 in the US. One possible set of class intervals
is 0–2,000, 2,001–4,000, 4,001–6,000, 6,001–8,000, 8,001–10,000, 10,001–12,000,
12,001–14,000, 14,001–16,000 and 16,001 and over (where all figures are in US
dollars).
Note that each class interval (with the exception of the 16,001 + category) is $2,000
wide. In other words, the class width for each of our bins is 2,000. For each class
interval we can count up the number of countries that have GDP per capita in that
interval. For instance, there are seven countries in our data set with real GDP per
capita between $4,001 and $6,000. The number of countries lying in one class inter-
val is referred to as the frequency3
of that interval. A histogram is a bar chart that
plots frequencies against class intervals.4
Figure 2.2 is a histogram of our cross-country GDP per capita data set that uses
the class intervals specified in the previous paragraph. Note that, if you do not wish
to specify class intervals, Excel will do it automatically for you. Excel also creates a
frequency table, which is located above the histogram.
The frequency table indicates the number of countries belonging to each class
interval (or bin). The numbers in the column labeled “Bin” indicate the upper bounds

of the class intervals. For instance, we can read that there are 33 countries with GDP
per capita less than $2,000; 22 countries with GDP per capita above $2,000 but less
than $4,000; and so on. The last row says that there are four countries with GDP per
capita above $16,000.
This same information is graphed in a simple fashion in the histogram. Graphing
allows for a quick visual summary of the cross-country distribution of GDP per
capita. We can see from the histogram that many countries are very poor, but that
there is also a “clump” of countries that are quite rich (e.g. 19 countries have GDP
per capita greater than $12,000). There are relatively few countries in between these
poor and rich groups (i.e. few countries fall in the bins labeled 8,000, 10,000 and
12,000).
Growth economists often refer to this clumping of countries into poor and rich
groups as the “twin peaks” phenomenon. In other words, if we imagine that the his-
togram is a mountain range, we can see a peak at the bin labeled 2,000 and a smaller
peak at 14,000. These features of the data can be seen easily from the histogram, but
would be difﬁcult to comprehend simply by looking at the raw data.
Exercise 2.2
(a) Recreate the histogram in Figure 2.2.
(b) Create histograms using different class intervals. For instance, begin by
letting your software package choose default values and see what you get,
then try values of your own.
(c) If you are using Excel, redo questions (a) and (b) with the “Cumulative Per-
centage” box clicked on. What does this do?
0
5
10
15
20
25
30
35
2,000
4,000
6,000
8,000
10,000
12,000
14,000
16,000
M
ore
Bin
Frequency
Bin Frequency
2,000 33
4,000 22
6,000 7
8,000 3
10,000
12,000
14,000
16,000
More 4
4
2
9
6
Fig. 2.2 Histogram.

XY-plots
Economists are often interested in the nature of the relationships between two or
more variables. For instance: “Are higher education levels and work experience asso-
ciated with higher wages among workers in a given industry?” “Are changes in the
money supply a reliable indicator of inﬂation changes?” “Do differences in capital
investment explain why some countries are growing faster than others?”
The techniques described previously are suitable for describing the behavior of
only one variable; for instance, the properties of real GDP per capita across coun-
tries in Figure 2.2. They are not, however, suitable for examining relationships
between pairs of variables.
Once we are interested in understanding the nature of the relationships between
two or more variables, it becomes harder to use graphs. Future chapters will discuss
regression analysis, which is the prime tool used by applied economists working with
many variables. However, graphical methods can be used to draw out some simple
aspects of the relationship between two variables. XY-plots (also called scatter dia-
grams) are particularly useful in this regard.
Figure 2.3 is a graph of data on deforestation (i.e. the average annual forest loss over
the period 1981–90 expressed as a percentage of total forested area) for 70 tropical
countries, along with data on population density (i.e. number of people per thousand
hectares). (This data is available in Excel ﬁle FOREST.XLS.) It is commonly thought that
countries with a high population density will likely deforest more quickly than those
with low population densities, since high population density may increase the pressure
to cut down forests for fuel wood or for agricultural land required to grow more food.
Figure 2.3 is an XY-plot of these two variables. Each point on the chart represents
a particular country. Reading up the Y-axis (i.e. the vertical one) gives us the rate of
0
1
2
3
4
5
6
0 500 1,000 1,500 2,000 2,500 3,000
Population per 1,000 hectares
Averageannualforestloss(%)
Nicaragua
Fig. 2.3 XY-plot of population density against deforestation.

deforestation in that country. Reading across the X-axis (i.e. the horizontal one) gives
us population density. It is certainly possible to label each point with its correspond-
ing country name. We have not done so here, since labels for 70 countries would
clutter the chart and make it difficult to read. However, one country, Nicaragua, has
been labeled. Note that this country has a deforestation rate of 2.6% per year (Y =
2.6) and a population density of 640 people per thousand hectares (X = 640).
The XY-plot can be used to give a quick visual impression of the relationship
between deforestation and population density. An examination of this chart indicates
some support for the idea that a relationship between deforestation and population
density does exist. For instance, if we look at countries with a low population density,
(less than 500 people per hectare, say), almost all of them have very low deforesta-
tion rates (less than 1% per year). If we look at countries with high population den-
sities (e.g. over 1,500 people per thousand hectares), almost all of them have high
deforestation rates (more than 2% per year). This indicates that there may be a
positive relationship between population density and deforestation (i.e. high values
of one variable tend to be associated with high values of the other; and low values,
associated with low values). It is also possible to have a negative relationship. This
would occur, for instance, if we substituted urbanization for population density in an
XY-plot. In this case, high levels of urbanization might be associated with low levels
of deforestation since expansion of cities would possibly reduce population pressures
in rural areas where forests are located.
It is worth noting that the positive or negative relationships found in the data are
only “tendencies”, and as such, do not hold necessarily for every country. That is,
there may be exceptions to the general pattern of high population density’s associa-
tion with high rates of deforestation. For example, on the XY-plot we can observe
one country with a high population density of roughly 1,300 and a low deforestation
rate of 0.7%. Similarly, low population density can also be associated with high rates
of deforestation, as evidenced by one country with a low population density of
roughly 150 but a high deforestation rate of almost 2.5% per year! As economists,
we are usually interested in drawing out general patterns or tendencies in the data.
However, we should always keep in mind that exceptions (in statistical jargon out-
liers) to these patterns typically exist. In some cases, finding out which countries don’t
fit the general pattern can be as interesting as the pattern itself.
Exercise 2.3
The file FOREST.XLS contains data on both the percentage increase in cropland
(the column labeled “Crop ch”) from 1980 to 1990 and on the percentage
increase in permanent pasture (the column labeled “Pasture ch”) over the same
period. Construct and interpret XY-plots of these two variables (one at a time)
against deforestation. Does there seem to be a positive relationship between
deforestation and expansion of pasture land? How about between deforestation
and the expansion of cropland?

Working with data: descriptive statistics
Graphs have an immediate visual impact that is useful for livening up an essay or
report. However, in many cases it is important to be numerically precise. Later
chapters will describe common numerical methods for summarizing the relationship
between several variables in greater detail. Here we discuss brieﬂy a few descriptive
statistics for summarizing the properties of a single variable. By way of motivation,
we will return to the concept of distribution introduced in our discussion on
histograms.
In our cross-country data set, real GDP per capita varies across the 90 countries.
This variability can be seen by looking at the histogram in Figure 2.2, which plots the
distribution of GDP per capita across countries. Suppose you wanted to summarize
the information contained in the histogram numerically. One thing you could do is
to present the numbers in the frequency table in Figure 2.2. However, even this table
may provide too many numbers to be easily interpretable. Instead it is common to
present two simple numbers called the mean and standard deviation.
The mean is the statistical term for the average. The mathematical formula for the
mean is given by:
where N is the sample size (i.e. number of countries) and S is the summation
operator (i.e. it adds up real GDP per capita for all countries). In our case, mean GDP
per capita is $5,443.80. Throughout this book, we will place a bar over a
variable to indicate its mean (i.e. is the mean of the variable Y, is the mean of
X, etc.).
The concept of the mean is associated with the middle of a distribution. For
example, if we look at the previous histogram, $5,443.80 lies somewhere in the middle
of the distribution. The cross-country distribution of real GDP per capita is quite
unusual, having the twin peaks property described earlier. It is more common for
distributions of economic variables to have a single peak and to be bell-shaped.
Figure 2.4 is a histogram that plots just such a bell-shaped distribution. For such dis-
tributions, the mean is located precisely in the middle of the distribution, under the
single peak.
Of course, the mean or average ﬁgure hides a great deal of variability across coun-
tries. Other useful summary statistics, which shed light on the cross-country variation
in GDP per capita, are the minimum and maximum. For our data set, minimum GDP
per capita is $408 (Chad) and maximum GDP is $17,945 (US). By looking at the dis-
tance between the maximum and minimum we can see how dispersed the distribu-
tion is.
The concept of dispersion is quite important in economics and is closely related
to the concepts of variability and inequality. For instance, real GDP per capita in 1992
in our data set varies from $408 to $17,945. If in the near future poorer countries
were to grow quickly, and richer countries to stagnate, then the dispersion of real
XY
Y
Y
N
ii
N
= =Â 1

GDP per capita in, say, 2012, might be signiﬁcantly less. It may be the case that the
poorest country at this time will have real GDP per capita of $10,000 while the richest
country will remain at $17,945. If this were to happen, then the cross-country distri-
bution of real GDP per capita would be more equal (less dispersed, less variable).
Intuitively, the notions of dispersion, variability and inequality are closely related.
The minimum and maximum, however, can be unreliable guidelines to dispersion.
For instance, what if, with the exception of Chad, all the poor countries experienced
rapid economic growth between 1992 and 2012, while the richer countries did not
grow at all? In this case, cross-country dispersion or inequality would decrease over
time. However, since Chad and the US did not grow, the minimum and maximum
would remain at $408 and $17,945, respectively.
A more common measure of dispersion is the standard deviation. (Confusingly,
statisticians refer to the square of the standard deviation as the variance.) Its math-
ematical formula is given by:
although in practice you will probably never have to calculate it by hand. You can cal-
culate it easily in Excel using either the Tools/Descriptive statistics or the Functions
facility. In some textbooks, a slightly different formula for calculating the standard
deviation is given where the N - 1 in the denominator is replaced by N.
This measure has little direct intuition. In our cross-country GDP data set, the
standard deviation is $5,369.496 and it is difﬁcult to get a direct feel for what this
s
Y Y
N
ii
N
=
-( )
-
=Â
2
1
1
0
5
10
15
20
25
–2.58
–2.03
–1.48
–0.93
–0.38
0.17
0.72
1.27
1.83
M
ore
Bin
Frequency
Fig. 2.4 Histogram.

number means in an absolute sense. However, the standard deviation can be inter-
preted in a comparative sense. That is, if you compare the standard deviations of
two different distributions, the one with the smaller standard deviation will always
exhibit less dispersion. In our example, if the poorer countries were to suddenly ex-
perience economic growth and the richer countries to stagnate, the standard devia-
tion would decrease over time.
Exercise 2.4
Construct and interpret descriptive statistics for the pasture change and crop-
land change variables in FOREST.XLS.
Chapter summary
1. Economic data come in many forms. Common types are time series, cross-
sectional and panel data.
2. Economic data can be obtained from many sources. The Internet is becom-
ing an increasingly valuable repository for many data sets.
3. Simple graphical techniques, including histograms and XY-plots, are useful
ways of summarizing the information in a data set.
4. Many numerical summaries can be used. The most important are the mean,
a measure of the location of a distribution, and the standard deviation, a
measure of how spread out or dispersed a distribution is.
Appendix 2.1: Index numbers
To illustrate the basic ideas in constructing a price index, we use the data shown in
Table 2.1.1 on the price of various fruits in various years.
Calculating a banana price index
We begin by calculating a price index for a single fruit, bananas, before proceeding to
the calculation of a fruit price index. As described in the text, calculating a price index
involves ﬁrst selecting a base year. For our banana price index, let us choose the year

2000 as the base year (although it should be stressed that any year can be chosen).
By deﬁnition, the value of the banana price index is 100 in this base year. How did
we transform the price of bananas in the year 2000 to obtain the price index value
of 100? It can be seen that this transformation involved taking the price of bananas
in 2000 and dividing by the price of bananas in 2000 (i.e. dividing the price by itself)
and multiplying by 100. To maintain comparability, this same transformation must be
applied to the price of bananas in every year. The result is a price index for bananas
(with the year 2000 as the base year). This is illustrated in Table 2.1.2.
From the banana price index, it can be seen that between 2000 and 2003 the price
of bananas increased by 4.4% and in 1999 the price of bananas was 97.8% as high
as in 2000.
Calculating a fruit price index
When calculating the banana price index (a single good), all we had to look at were
the prices of bananas. However, if we want to calculate a fruit price index (involving
several goods), then we have to combine the prices of all fruits together somehow.
One thing you could do is simply average the prices of all fruits together in each year
(and then construct a price index in the same manner as for the banana price index).
However, this strategy is usually inappropriate since it implicitly weights all goods
equally to one another (i.e. a simple average would just add up the prices of the three
Table 2.1.1 Prices of different fruits in different years (£/kg).
Year Bananas Apples Kiwi fruit
1999 0.89 0.44 1.58
2000 0.91 0.43 1.66
2001 0.91 0.46 1.90
2002 0.94 0.50 2.10
2003 0.95 0.51 2.25
Table 2.1.2 Calculating a banana price index.
Year Price of bananas Transformation Price index
1999 0.89 ¥100 ∏ 0.91 97.8
2000 0.91 ¥100 ∏ 0.91 100
2001 0.91 ¥100 ∏ 0.91 100
2002 0.94 ¥100 ∏ 0.91 103.3
2003 0.95 ¥100 ∏ 0.91 104.4

fruits and divide by three). In our example (and most real-world applications), this
equal weighting is unreasonable. (An exception to this is the Dow Jones Industrial
Average which does equally weight the stock prices of all companies included in
making the index.) An examination of Table 2.1.1 reveals that the prices of bananas
and apples are going up only slightly over time (and in some years their prices are not
changing or are even dropping). However, the price of kiwi fruit is going up rapidly
over time. Bananas and apples are common fruits purchased frequently by many
people, whereas kiwi fruit are an obscure exotic fruit purchased infrequently by a tiny
minority of people. In light of this, it is unreasonable to weight all three fruits equally
when calculating a price index. A fruit price index which was based on a simple
average would reveal that the fruit prices were growing at a fairly rapid rate (i.e. com-
bining the slow growth of banana and apple prices with the very fast growth of kiwi
fruit prices would yield a fruit price index which indicates moderately fast growth).
However, if the government were to use such a price index to report “fruit prices are
increasing at a fairly rapid rate”, the vast majority of people would find this report
inconsistent with their own experience. That is, the vast majority of people buy only
bananas and apples and the prices of these fruits are growing only slowly over time.
The line of reasoning in the previous paragraph suggests that a price index which
weights all goods equally will not be a sensible one. It also suggests how one might
construct a sensible fruit price index: use a weighted average of the prices of all fruits
to construct an index where the weights are chosen so as to reflect the importance
of each good. In our fruit price index, we would want to attach more weight to
bananas and apples (the common fruits) and little weight to the exotic kiwi fruit.5
There are many different ways of choosing such weights. Here I shall describe two
common choices based on the idea that the weights should reflect the amount of each
fruit that is purchased. Of course, the amount of each fruit purchased can vary over
time and it is with regards to this issue that our two price indices differ.
The Laspeyres price index (using base year weights)
The Laspeyres price index uses the amount of each fruit purchased in the base year
(2000 in our example) to construct weights. In words, to construct the Laspeyres price
index, you calculate the average price of fruit in each year using a weighted average
where the weights are proportional to the amount of each fruit purchased in
2000. You then use this average fruit price to construct an index in the same manner
as we did for the banana price index (see Table 2.1.2).
Intuitively, if the average consumer spends 100 times more on bananas than kiwi
fruit in 2000, then banana prices will receive 100 times as much weight as kiwi fruit
prices in the Laspeyres price index. The Laspeyres price index can be written in terms
of a mathematical formula. Let P denote the price of a good, Q denote the quantity
of the good purchased and subscripts denote the good and year with bananas being
good 1, apples good 2 and kiwi fruit good 3. Thus, for instance, P1,2000, is the price

of bananas in the year 2000, Q3,2002 is the quantity of kiwi fruit purchased in 2002,
etc. See Appendix 1.1 if you are having trouble understanding this subscripting
notation or the summation operator used below.
With this notational convention established, the Laspeyres price index (LPI) in year
t (for t = 1999, 2000, 2001, 2002 and 2003) can be written as:
Note that the numerator of this formula takes the price of each fruit and multi-
plies it by the quantity of that fruit purchased in the year 2000. This ensures that
bananas and apples receive much more weight in the Laspeyres price index. We will
not explain the denominator other than to note that it is necessary to ensure that
the Laspeyres price index is a valid index with a value of 100 in the base year. For
the more mathematically inclined, the denominator ensures that the weights in the
weighted average sum to one (which is necessary to ensure that it is a proper weighted
average).
Note also that the deﬁnition of the Laspeyres price index above has been written
for our fruit example involving three goods with a base year of 2000. In general, the
formula above can be extended to allow for any number of goods and any base year
by changing the 3 and 2000 as appropriate.
The calculation of the Laspeyres price index requires us to know the quantities
purchased of each fruit. Table 2.1.3 presents these quantities.
The Laspeyres fruit price index can be interpreted in the same way as the banana
price index. For instance, we can say that, between 2000 and 2003, fruit prices rose
by 8.7%.
The Paasche price index (using current year weights)
The Laspeyres price index used base year weights to construct an average fruit price
from the prices of the three types of fruit. However, it is possible that the base year
weights (in our example, the base year was 2000) may be inappropriate if fruit con-
sumption patterns are changing markedly over time. In our example, bananas and
apples are the predominant fruits and, in all years, there are few kiwi fruit eaters. Our
LPI
P Q
P Q
t
it ii
i ii
= ¥=
=
Â
Â
,
, ,
.
20001
3
2000 20001
3
100
Table 2.1.3 Quantities purchased of fruits (thousands of kg).
Year Bananas Apples Kiwi fruit
1999 100 78 1
2000 100 82 1
2001 98 86 3
2002 94 87 4
2003 96 88 5

Laspeyres price index (sensibly) weighted the prices of bananas and apples much
more heavily in the index than kiwi fruit. But what would have happened if, in 2001,
there had been a health scare indicating that eating apples was unhealthy and people
stopped eating apples and ate many more kiwi fruit instead? The Laspeyres price
index would keep on giving a low weight to kiwi fruit and a high weight to apples
even though people were now eating more kiwi fruit. The Paasche price index is an
index which attempts to surmount this problem by using current year purchases to
weight the individual fruits in the index.
In words, to construct the Paasche price index, you calculate the average price of
fruit in each year using a weighted average where the weights are proportional
to amount of each fruit purchased in the current year. You then use this average
fruit price to construct an index in the same manner as we did for the banana price
index (see Table 2.1.2).
The mathematical formula for the Paasche price index (PPI) in year t (for t = 1999,
2000, 2001, 2002 and 2003) can be written as:
Note that PPI is the same as LPI except that Qit appears in the PPI formula where
Qi,2000 appeared in the LPI formula. Thus, the two indexes are the same except for
the fact that PPI is using current year purchases instead of base year purchases.
Table 2.1.5 shows the calculation of the Paasche price index using the fruit price
data of Table 2.1.1 and the data on quantity of fruit purchased in Table 2.1.3.
Note that, since the Paasche price index does not weight the prices in the same
manner as the Laspeyres price index, we do not get exactly the same results in Tables
2.1.5 and 2.1.4. For instance, the Paasche price index says that, between 2000 and
2003, fruit prices rose by 10.4% (whereas the Laspeyres price index said 8.7%).
The Paasche and Laspeyres price indices are merely two out of myriad possibil-
ities. We will not discuss any of the other possibilities. However, it is important to
note that indices arise in many places in economics and ﬁnance. For instance, mea-
sures of inﬂation reported in the newspapers are based on price indices. In the
economy, there are thousands of goods that people buy and price indices such as
PPI
P Q
P Q
t
it iti
i iti
= ¥=
=
Â
Â
1
3
20001
3
100
,
.
Table 2.1.4 Calculating the Laspeyres fruit price index.
Numerator = Denominator = Laspeyres price
Year S3
i=1Pit Qi,2000 S3
i=1Pi,2000 Qi,2000 index
1999 126.64 127.92 99.0
2000 127.92 127.92 100
2001 130.62 127.92 102.1
2002 137.1 127.92 107.2
2003 139.07 127.92 108.7

the consumer price index (CPI) or retail price index (RPI) are weighted averages of
the prices of these thousands of goods. Information about stock markets is often
expressed in terms of stock price indices.
There is one other issue that sometimes complicates empirical studies, especially
involving macroeconomic data. Government statistical agencies often update the base
year they use in calculating their price index. So, when collecting data, you will some-
times face the situation where the first part of your data uses one base year and the
last part a different one. This problem is not hard to fix if you have one overlap year
where you know the value of the index in terms of both base years. Table 2.1.6
provides an illustration of how you can splice an index together when the base year
changes in this manner.
The statistical office has constructed a price index using 1995 as a base year, but
discontinued it after the year 2000. This is in the column labeled “Old index with
base year 1995”. In 2001, the statistical office started constructing the index using
2001 as the base year, but also went back and worked out the year 2000 value for this
index using the new base year. This new index is listed in the column labeled “New
index with base year 2001”. Note that we have one overlapping year, 2000. In order
Table 2.1.5 Calculating the Paasche fruit price index.
Numerator = Denominator = Paasche price
Year S3
i=1Pit Qit S3
i=1Pi,2000Qit index
1999 124.90 126.20 99.0
2000 127.92 127.92 100
2001 134.44 131.14 102.5
2002 140.26 129.59 108.2
2003 147.33 133.50 110.4
Table 2.1.6 Splicing together an index when the base year changes.
Old index with New index with Transformation Spliced index
Year base year 1995 base year 2001 to old index base year 2001
1995 100 ¥ 95 ∏ 107 88.8
1996 102 ¥ 95 ∏ 107 90.6
1997 103 ¥ 95 ∏ 107 91.5
1998 103 ¥ 95 ∏ 107 91.5
1999 105 ¥ 95 ∏ 107 93.2
2000 107 95 95
2001 100 100
2002 101 101
2003 105 105

to make the 2000 value for the old index and the new the same we have to take the
old index value and multiple it by 95 and divide it by 107. In order to be consistent
we must apply this same transformation to all values of the old index. The result of
transforming all values of the old index in this manner is given in the last column of
Table 2.1.6. This spliced index can now be used for empirical work as now the entire
index has the same base year of 2001.
Appendix 2.2: Advanced descriptive statistics
The mean and standard deviation are the most common descriptive statistics but
many others exist. The mean is the simplest measure of location of a distribution.
The word “location” is meant to convey the idea of the center of the distribution.
The mean is the average. Other common measures of location are the mode and
median.
To distinguish between the mean, mode and median, consider a simple example.
Seven people report their respective incomes in £ per annum as: £18,000, £15,000,
£9,000, £15,000, £16,000, £17,000 and £20,000. The mean, or average, income of
these seven people is £15,714.
The mode is the most common value. In the present example, two people have
reported incomes of £15,000. No other income value is reported more than once.
Hence, £15,000 is the modal income for these seven people.
The median is the middle value. That is, it is the value that splits the distribution
into two equal halves. In our example, it is the income value at which half the people
have higher incomes and half the people have lower incomes. Here the median is
£16,000. Note that three people have incomes less than the median and three have
incomes higher than it.
The mode and median can also be motivated through consideration of Figures 2.2
and 2.4, which plot two different histograms or distributions. A problem with the
mode is that there may not be a most common value. For instance, in the GDP per
capita data set (GDPPC.XLS), no two countries have precisely the same values. So there
is no value that occurs more than once. For cases like this, the mode is the highest
point of the histogram. A minor practical problem with deﬁning the mode in this
way is that it can be sensitive to the choice of class intervals (and this is why Excel
gives a slightly different answer for the mode for GDPPC.XLS than the one given here).
In Figure 2.2, the histogram is highest over the class interval labeled 2,000. Remem-
ber, Excel’s choice of labeling means that the class interval runs from 0 to 2,000.
Hence, we could say that “the class interval 0 to 2,000 is the modal (or most likely)
value”. Alternatively, it is common to report the middle value of the relevant
class interval as the mode. In this case, we could say, “the mode is $1,000”. The mode
is probably the least commonly used of the three measures of location introduced
here.

To understand the median, imagine that all the area of the histogram is shaded.
The median is the point on the X-axis which divides this shaded area precisely in half.
For Figure 2.4 the highest point (i.e. the mode) is also the middle point that divides
the distribution in half (i.e. the median). It turns out it is also the mean. However, in
Figure 2.2 the mean ($5,443.80), median ($3,071.50) and mode ($1,000) are quite
different.
Other useful summary statistics are based on the notion of a percentile. Consider
our GDP per capita data set. For any chosen country, say Belgium, you can ask “how
many countries are poorer than Belgium?” or, more precisely, “what proportion of
countries are poorer than Belgium?”. When we ask such questions we are, in effect,
asking what percentile Belgium is at. Formally, the Xth percentile is the data value
(e.g. a GDP per capita figure) such that X% of the observations (e.g. countries) have
lower data values. In the cross-country GDP data set, the 37th percentile is $2,092.
This is the GDP per capita figure for Peru. 37% of the countries in our data set are
poorer than Peru.
Several percentiles relate to concepts we have discussed before. The 50th percentile
is the median. The minimum and maximum are the 0th and 100th percentile. The
percentile divides the data range up into hundredths, while other related concepts use
other basic units. Quartiles divide the data range up into quarters. Hence, the first
quartile is equivalent to the 25th percentile, the second quartile, the 50th percentile
(i.e. the median) and the third quartile, the 75th percentile. Deciles divide the data
up into tenths. In other words, the first decile is equivalent to the 10th percentile, the
second decile, the 20th percentile, etc.
After the standard deviation, the most common measure of dispersion is the inter-
quartile range. As its name suggests, it measures the difference between the third
and first quartiles. For the cross-country data set, 75% of countries have GDP per
capita less than $9,802 and 25% have GDP per capita less than $1,162. In other words,
$1,162 is the first quartile and $9,802 is the third quartile. The inter-quartile range is
$9,802 - $1,162 = $8,640.
Endnotes
1. As emphasized in chapter 1, this is not a book about collecting data. Nevertheless, it is
useful to offer a few brief pointers about how to look for data sets.
2. As will be discussed in later chapters, it is sometimes convenient to take the natural loga-
rithm, or ln(.) of variables. The definition and properties of logarithms can be found in
virtually any introductory mathematics textbook. Using the properties of logarithms, it can
be shown that the percentage change in a variable is approximately 100 ¥ [ln(Yt) - ln(Yt-1)].
This formula is often used in practice and relates closely to ideas in nonstationary time series
(see Chapters 9 and 10).
3. Note that the use of the word “frequency” here as meaning “the number of observations
that lie in a class interval” is somewhat different from the use of the word “frequency” in
time series analysis (see the discussion of time series data earlier).

4. Excel creates the histogram using the Histogram command (in Tools/Data Analysis). It
simply plots the bins on the horizontal axis and the frequency (or number of observations
in a class) on the vertical axis. Note that most statistics books plot class intervals against
frequencies divided by class width. This latter strategy corrects for the fact that class widths
may vary across class intervals. In other words, Excel does not calculate the histogram
correctly. Provided the class intervals are the same width (or nearly so) this error is not of
great practical importance.
5. For the student of ﬁnance interested in following up our earlier discussion of the Dow
Jones Industrial Average it should be mentioned that the S&P500 is a price index which
weights stock prices by the size of the company.

Chapter 2 and 3: basic Data handling koop

C H A P T E R
Correlation
3
Often economists are interested in investigating the nature of the relationship
between different variables, such as the education level of workers and their wages
or interest rates and inflation. Correlation is an important way of numerically quan-
tifying the relationship between two variables. A related concept, introduced in
future chapters, is regression, which is essentially an extension of correlation to cases
of three or more variables that introduces an aspect of causality. As you will quickly
find as you read through this chapter and those that follow, it is no exaggeration to
say that correlation and regression are the most important unifying concepts of this
book.
In this chapter, we will first describe the theory behind correlation, and then work
through a few examples designed to think intuitively about the concept in different
ways.
Understanding correlation
Let X and Y be two variables (e.g. population density and deforestation, respectively)
and let us also suppose that we have data on i = 1, .. , N different units (e.g. coun-
tries). The correlation between X and Y is denoted by the small letter, r, and its
precise mathematical formula is given in Appendix 3.1. Of course, in practice, you
will never actually have to use this formula directly. Any spreadsheet or econometrics
software package will do it for you. In Excel, you can use the Tools/Data Analysis
or Function Wizard©
to calculate them. It is usually clear from the context to which
variables r refers. However, in some cases we will use subscripts to indicate that rXY

is the correlation between variables X and Y, rXZ the correlation between variables X
and Z, etc.
Once you have calculated the correlation between two variables you will obtain a
number (e.g. r = 0.55). It is important that you know how to interpret this number.
In this section, we will try to develop some intuition about correlation. First, however,
let us briefly list some of the numerical properties of correlation.
Properties of correlation
1. r always lies between -1 and 1, which may be written as -1 £ r £ 1.
2. Positive values of r indicate a positive correlation between X and Y. Nega-
tive values indicate a negative correlation. r = 0 indicates that X and Y are
uncorrelated.
3. Larger positive values of r indicate stronger positive correlation. r = 1 indicates
perfect positive correlation. Larger negative values1
of r indicate stronger negative
correlation. r = -1 indicates perfect negative correlation.
4. The correlation between Y and X is the same as the correlation between X and
Y.
5. The correlation between any variable and itself (e.g. the correlation between Y and
Y) is 1.
Understanding correlation through verbal reasoning
Statisticians use the word correlation in much the same way as the layperson does.
The following continuation of the deforestation/population density example from
Chapter 2 will serve to illustrate verbal ways of conceptualizing the concept of
correlation.
Example: The correlation between deforestation and
population density
Let us suppose that we are interested in investigating the relationship between
deforestation and population density. Remember that Excel file FOREST.XLS
contains data on these variables (and others) for a cross-section of 70 tropical
countries. Using Excel, we find that the correlation between deforestation (Y)
and population density (X ) is 0.66. Being greater than zero, this number allows
us to make statements of the following form:
1. There is a positive relationship (or positive association) between deforesta-
tion and population density.

2. Countries with high population densities tend to have high deforestation
rates. Countries with low population densities tend to have low deforesta-
tion rates. Note that we use the word “tend” here. A positive correlation
does not mean that every country with a high population density necessar-
ily has a high deforestation rate, but rather that this is the general tendency.
It is possible that a few individual countries do not follow this pattern (see
the discussion of outliers in Chapter 2).
3. Deforestation rates vary across countries as do population densities (the
reason we call them “variables”). Some countries have high deforestation
rates, others have low rates. This high/low cross-country variance in defor-
estation rates tends to “match up” with the high/low variance in population
densities.
All that the preceding statements require is for r to be positive. If r were nega-
tive the opposite of these statements would hold. For instance, high values of
X would be associated with low values of Y, etc. It is somewhat more difﬁcult
to get an intuitive feel for the exact number of the correlation (e.g. how is the
correlation 0.66 different from 0.26?). The XY-plots discussed below offer
some help, but here we will brieﬂy note an important point to which we shall
return when we discuss regression:
4. The degree to which deforestation rates vary across countries can be mea-
sured numerically using the formula for the standard deviation discussed in
Chapter 2. As mentioned in point 3 above, the fact that deforestation and
population density are positively correlated means that their patterns of
cross-country variability tend to match up. The correlation squared (r2
) mea-
sures the proportion of the cross-country variability in deforestation that
matches up with, or is explained by, the variance in population density. In
other words, correlation is a numerical measure of the degree to which pat-
terns in X and Y correspond. In our population/deforestation example,
since 0.662
= 0.44, we can say that 44% of the cross-country variance in
deforestation can be explained by the cross-country variance in population
density.
Exercise 3.1
(a) Using the data in FOREST.XLS, calculate and interpret the mean, standard
deviation, minimum and maximum of deforestation and population density.
(b) Verify that the correlation between these two variables is 0.66.
Correlation 37

Example: House prices in Windsor, Canada
The Excel file HPRICE.XLS contains data relating to N = 546 houses sold in
Windsor, Canada in the summer of 1987. It contains the selling price (in Cana-
dian dollars) along with many characteristics for each house. We will use this
data set extensively in future chapters, but for now let us focus on just a few
variables. In particular, let us assume that Y = the sales price of the house and
X = the size of its lot in square feet, lot size being the area occupied by the
house itself plus its garden or yard. The correlation between these two variables
is rXY = 0.54.
The following statements can be made about house prices in Windsor:
1. Houses with large lots tend to be worth more than those with small lots.
2. There is a positive relationship between lot size and sales price.
3. The variance in lot size accounts for 29% (i.e. 0.542
= 0.29) of the variabil-
ity in house prices.
Now let us add a third variable, Z = number of bedrooms. Calculating the
correlation between house prices and number of bedrooms, we obtain rYZ =
0.37. This result says, as we would expect, that houses with more bedrooms tend
to be worth more than houses with fewer bedrooms.
Similarly, we can calculate the correlation between number of bedrooms and
lot size. This correlation turns out to be rXZ = 0.15, and indicates that houses
with larger lots also tend to have more bedrooms. However, this correlation is
very small and quite unexpectedly, perhaps, suggests that the link between lot
size and number of bedrooms is quite weak. In other words, you may have
expected that houses on larger lots, being bigger, would have more bedrooms
than houses on smaller lots. But the correlation indicates that there is only a
weak tendency for this to occur.
The above example allows us to motivate briefly an issue of importance in econo-
metrics, namely, that of causality. Indeed, economists are often interested in finding
out whether one variable “causes” another. We will not provide a formal definition
of causality here but instead will use the word in its everyday meaning. In this example,
it is sensible to use the positive correlation between house price and lot size to reflect
a causal relationship. That is, lot size is a variable that directly influences (or causes)
house prices. However, house prices do not influence (or cause) lot size. In other
words, the direction of causality flows from lot size to house prices, not the other
way around.
Another way of thinking about these issues is to ask yourself what would happen
if a homeowner were to purchase some adjacent land, and thereby increase the lot
size of his/her house. This action would tend to increase the value of the house (i.e.

an increase in lot size would cause the price of the house to increase). However, if
you reflect on the opposite question: “will increasing the price of the house cause lot
size to increase?” you will see that the opposite causality does not hold (i.e. house
price increases do not cause lot size increases). For instance, if house prices in
Windsor were suddenly to rise for some reason (e.g. due to a boom in the economy)
this would not mean that houses in Windsor suddenly got bigger lots.
The discussion in the previous paragraph could be repeated with “lot size” replaced
by “number of bedrooms”. That is, it is reasonable to assume that the positive cor-
relation between Y = house prices and Z = number of bedrooms is due to Z’s influ-
encing (or causing) Y, rather than the opposite. Note, however, that it is difficult to
interpret the positive (but weak) correlation between X = lot size and Y = number of
bedrooms as reflecting causality. That is, there is a tendency for houses with many
bedrooms to occupy large lots, but this tendency does not imply that the former
causes the latter.
One of the most important things in empirical work is knowing how to interpret
your results. The house example illustrates this difficulty well. It is not enough just to
report a number for a correlation (e.g. rXY = 0.54). Interpretation is important too.
Interpretation requires a good intuitive knowledge of what a correlation is in addi-
tion to a lot of common sense about the economic phenomenon under study. Given
the importance of interpretation in empirical work, the following section will present
several examples to show why variables are correlated and how common sense can
guide us in interpreting them.
Exercise 3.2
(a) Using the data in HPRICE.XLS, calculate and interpret the mean, standard
deviation, minimum and maximum of Y = house price (labeled “sale price”
in HPRICE.XLS), X = lot size and Z = number of bedrooms (labeled
“#bedroom”).
(b) Verify that the correlation between X and Y is the same as given in the
example. Repeat for X and Z then for Y and Z.
(c) Now add a new variable, W = number of bathrooms (labeled “#bath”).
Calculate the mean of W.
(d) Calculate and interpret the correlation between W and Y. Discuss to what
extent it can be said that W causes Y.
(e) Repeat part (d) for W and X and then for W and Z.
Understanding why variables are correlated
In our deforestation/population density example, we discovered that deforestation
and population density are indeed correlated positively, indicating a positive relation-
ship between the two. But what exact form does this relationship take? As discussed
Correlation 39

above, we often like to think in terms of causality or influence, and it may indeed be
the case that correlation and causality are closely related. For instance, the finding that
population density and deforestation are correlated could mean that the former
directly causes the latter. Similarly, the finding of a positive correlation between edu-
cation levels and wages could be interpreted as meaning that more education does
directly influence the wage one earns. However, as the following examples demon-
strate, the interpretation that correlation implies causality is not always necessarily an
accurate one.
Example: Correlation does not necessarily imply causality
It is widely accepted that cigarette smoking causes lung cancer. Let us assume
that we have collected data from many people on (a) the number of cigarettes
each person smokes per week (X) and (b) on whether they have ever had or
now have lung cancer (Y). Since smoking causes cancer we would undoubtedly
find rXY > 0; that is, that people who smoked tend to have higher rates of lung
cancer than non-smokers. Here the positive correlation between X and Y indi-
cates direct causality.
Now suppose that we also have data on the same people, measuring the
amount of alcohol they drink in a typical week. Let us call this variable Z. In
practice, it is the case that heavy drinkers also tend to smoke and, hence, rXZ >
0. This correlation does not mean that cigarette smoking also causes people to
drink. Rather it probably reflects some underlying social attitudes. It may reflect
the fact, in other words, that people who smoke do not worry about their nutri-
tion, or that their social lives revolve around the pub, where drinking and
smoking often go hand in hand. In either case, the positive correlation between
smoking and drinking probably reflects some underlying cause (e.g. social atti-
tude), which in turn causes both. Thus, a correlation between two variables does
not necessarily mean that one causes the other. It may be the case that an under-
lying third variable is responsible.
Now consider the correlation between lung cancer and heavy drinking. Since
people who smoke tend to get lung cancer more, and people who smoke also
tend to drink more, it is not unreasonable to expect that lung cancer rates will
be higher among heavy drinkers (i.e. rYZ > 0). Note that this positive correlation
does not imply that alcohol consumption causes lung cancer. Rather, it is ciga-
rette smoking that causes cancer, but smoking and drinking are related to some
underlying social attitude. This example serves to indicate the kind of compli-
cated patterns of causality which occur in practice, and how care must be taken
when trying to relate the concepts of correlation and causality.

Example: Direct versus indirect causality
Another important distinction is that between direct (or immediate) and indi-
rect (or proximate) causality. Recall that in our deforestation/population density
example, population density (X ) and deforestation (Y ) were found to be pos-
itively correlated (i.e. rXY > 0). One reason for this positive correlation is that
high population pressures in rural areas cause farmers to cut down forests to
clear new land in order to grow food. It is this latter on-going process of agri-
cultural expansion which directly causes deforestation. If we calculated the cor-
relation between deforestation and agricultural expansion (Z ), we would find
rYZ > 0. In this case population density would be an indirect cause, and agricul-
tural expansion, a direct cause of deforestation. In other words, we can say that
X (population pressures) causes Z (agricultural expansion), which in turn causes
Y (deforestation). Such a pattern of causality is consistent with rXY > 0 and rZY
> 0.
In our house price example, however, it is likely that the positive correlations
we observed reflect direct causality. For instance, having a larger lot is consid-
ered by most people to be a good thing in and of itself, so that increasing the
lot size should directly increase the value of a house. There is no other inter-
vening variable here, and hence we say that the causality is direct.2
The general message that should be taken from these examples is that cor-
relations can be very suggestive, but cannot on their own establish causality. In
the smoking/cancer example above, the finding of a positive correlation
between smoking and lung cancer, in conjunction with medical evidence on the
manner in which substances in cigarettes trigger changes in the human body,
have convinced most people that smoking causes cancer. In the house price
example, common sense tells us that the variable number of bedrooms directly
influences house prices. In economics, the concept of correlation can be used
in conjunction with common sense or a convincing economic theory to estab-
lish causality.
Exercise 3.3
People with university education tend to hold higher paying jobs than those with
fewer educational qualifications. This could be due to the fact that a university
education provides important skills that employers value highly. Alternatively, it
could be the case that smart people tend to go to university and that employers
want to hire these smart people (i.e. a university degree is of no interest in and
of itself to employers).
Correlation 41

Suppose you have data on Y = income, X = number of years of schooling
and Z = the results of an intelligence test3
of many people, and that you have
calculated rXY, rXZ and rYZ. In practice, what signs would you expect these cor-
relations to have? Assuming the correlations do have the signs you expect, can
you tell which of the two stories in the paragraph above is correct?
Understanding correlation through XY-plots
Intuition about the meaning of correlations can also be obtained from the XY-plots
described in Chapter 2. Recall that in this chapter we discussed positive and negative
relationships based on whether the XY-plots exhibited a general upward or downward
slope.4
If two variables are correlated, then an XY-plot of one against the other will
also exhibit such patterns. For instance, the XY-plot of population density against
deforestation exhibits an upward sloping pattern (see Figure 2.3). This plot implies that
these two variables should be positively correlated, and we ﬁnd that this is indeed the
case from the correlation, r = 0.66. The important point here is that positive correla-
tion is associated with upward sloping patterns in the XY-plot and negative correlation
isassociatedwithdownwardslopingpatterns.AlltheintuitionwedevelopedaboutXY-
plots in the previous chapter can now be used to develop intuition about correlation.
Figure 3.1 uses the Windsor house price data set (HPRICE.XLS) to produce an XY-
plot of X = lot size against Y = house price. Recall that that the correlation between
these two variables was calculated as rXY = 0.54, which is a positive number. This pos-
0
20,000
40,000
60,000
80,000
100,000
120,000
140,000
160,000
180,000
200,000
0 2,000 4,000 6,000 8,000 10,000 12,000 14,000 16,000 18,000
Lot size (square feet)
Houseprice(Canadiandollars)
Fig. 3.1 XY-plot of price versus lot size.

Correlation 43
–2
–1.5
–1
–0.5
0.5
1
1.5
2
–1 –0.5 0 0.5 1 1.5
2.5
0
–2.5
–1.5
Fig. 3.2 XY-plot of two perfectly correlated variables (r = 1).
–1.5
–1
–0.5
0.5
1
–0.5 0.5 1 1.5–1 2
0
0
–2
1.5
Fig. 3.3 XY-plot of two positively correlated variables (r = 0.51).
itive (upward sloping) relationship between lot size and house price can clearly be
seen in Figure 3.1. That is, houses with small lots (i.e. small X-axis values) also tend
to have small prices (i.e. small Y-axis values). Conversely, houses with large lots tend
to have high prices.
The previous discussion relates mainly to the sign of the correlation. However,
XY-plots can also be used to develop intuition about how to interpret the magnitude
of a correlation, as the following examples illustrate.
Figure 3.2 is an XY-plot of two perfectly correlated variables (i.e. r = 1). Note that
they do not correspond to any actual economic data, but were simulated on the com-
puter. All the points lie exactly on a straight line.
Figure 3.3 is an XY-plot of two variables which are positively correlated (r = 0.51),
but not perfectly correlated. Note that the XY-plot still exhibits an upward sloping
pattern, but that the points are much more widely scattered.

Figure 3.4 is an XY-plot of two completely uncorrelated variables (r = 0). Note
that the points seem to be randomly scattered over the entire plot.
Plots for negative correlation exhibit downward sloping patterns, but otherwise the
same sorts of patterns noted above hold for them. For instance, Figure 3.5 is an XY-
plot of two variables that are negatively correlated (r = -0.58).
These figures illustrate one way of thinking about correlation: correlation indicates
how well a straight line can be fit through an XY-plot. Variables that are strongly cor-
related fit on or close to a straight line. Variables that are weakly correlated are more
scattered in an XY-plot.
–1.5
–1
–0.5
0
0.5
1
1.5
2
–2.5 –2 –1.5 –1 –0.5 0.5 1.5 2 2.510
–2
Fig. 3.4 XY-plot of two uncorrelated variables (r = 0).
Exercise 3.4
The file EX34.XLS contains four variables: Y, X1, X2 and X3.
(a) Calculate the correlation between Y and X1. Repeat for Y and X2 and for Y
and X3.
(b) Create an XY-plot involving Y and X1. Repeat for Y and X2 and for Y and
X3.
(c) Interpret your results for a) and b).

Correlation between several variables
Correlation is a property that relates two variables together. Frequently, however,
economists must work with several variables. For instance, house prices depend on
the lot size, number of bedrooms, number of bathrooms and many other character-
istics of the house. As we shall see in subsequent chapters, regression is the most
appropriate tool for use if the analysis contains more than two variables. Yet it is also
not unusual for empirical researchers, when working with several variables, to calcu-
late the correlation between each pair. This calculation is laborious when the number
of variables is large. For instance, if we have three variables, X, Y and Z, then there
are three possible correlations (i.e. rXY, rXZ and rYZ). However, if we add a fourth vari-
able, W, the number increases to six (i.e. rXY, rXZ, rXW, rYZ, rYW and rZW). In general, for
M different variables there will be M ¥ (M - 1)/2 possible correlations. A convenient
way of ordering all these correlations is to construct a matrix or table, as illustrated
by the following example.
CORMAT.XLS contains data on three variables labeled X, Y and Z. X is in the ﬁrst
column, Y the second and Z the third. Using Excel, we can create a correlation matrix
(Table 3.1) for these variables.
The number 0.318237 is the correlation between the variable in the ﬁrst column
(X ), and that in the second column (Y ). Similarly, -0.13097 is the correlation between
X and Z, and 0.096996, the correlation between Y and Z. Note that the 1s in the cor-
relation matrix indicate that any variable is perfectly correlated with itself.
Correlation 45
–0.2
–0.1
0.1
0.2
–0.2 –0.1 0.1 0.2
0
0 0.3
0.3
–0.3
–0.3
Fig. 3.5 XY-plot of two negatively correlated variables (r = -0.58).

Exercise 3.5
(a) Using the data in FOREST.XLS, calculate and interpret a correlation matrix
involving deforestation, population density, change in pasture and change
in cropland.
(b) Repeat part (a) using the following variables in the data set HPRICE.XLS:
house price, lot size, number of bedrooms, number of bathrooms and
number of storeys. How many individual correlations have you calculated?
Table 3.1 The correlation matrix for X, Y and Z.
Column 1 Column 2 Column 3
Column 1 1
Column 2 0.318237 1
Column 3 -0.13097 0.096996 1
Chapter summary
1. Correlation is a common way of measuring the relationship between two
variables. It is a number that can be calculated using Excel or any spread-
sheet or econometric software package.
2. Correlation can be interpreted in a common sense way as a numerical
measure of a relationship or association between two variables.
3. Correlation can also be interpreted graphically by means of XY-plots. That
is, the sign of the correlation relates to the slope of a best ﬁtting line through
an XY-plot. The magnitude of the correlation relates to how scattered the
data points are around the best ﬁtting line.
4. Correlations can arise for many reasons. However, correlation does not nec-
essarily imply causality between two variables.
Appendix 3.1: Mathematical details
The correlation between X and Y is referred to by the small letter r and is calculated
as:
r
Y Y X X
Y Y X X
i ii
N
ii
N
ii
N
=
-( ) -( )
-( ) -( )
=
= =
Â
Â Â
1
2
1
2
1
,

where X¯ and Y¯ are the means of X and Y (see Chapter 2). More intuitively, note that
if we were to divide the numerator and denominator of the previous expression by
N - 1, then the denominator would contain the product of the standard deviations
of X and Y, and the numerator, the covariance between X and Y. Covariance is a
concept that we have not defined here, but you may come across it in the future, par-
ticularly if you are interested in developing a deeper understanding of the statistical
theory underlying correlation.
Endnotes
1. By “larger negative values” we mean more negative. For instance, -0.9 is a larger negative
value than -0.2.
2. An alternative explanation is that good neighborhoods tend to have houses with large lots.
People are willing to pay extra to live in a good neighborhood. Thus, it is possible that
houses with large lots tend also to have higher sales prices, not because people want large
lots, but because they want to live in good neighborhoods. In other words, “lot size” may
be acting as a proxy for the “good neighborhood” effect. We will discuss such issues in
more detail in later chapters on regression. You should merely note here that the inter-
pretation of correlations can be quite complicated and a given correlation pattern may be
consistent with several alternative stories.
3. It is a controversial issue among psychologists and educators as to whether intelligence
tests really are meaningful measures of intelligence. For the purposes of answering this
question, avoid this controversy and assume that they are indeed an accurate reflection of
intelligence.
4. We will formalize the meaning of “upward” or “downward” sloping patterns in the XY-
plots when we come to regression. To aid in interpretation, think of drawing a straight line
through the points in the XY-plot that best captures the pattern in the data (i.e. is the best
fitting line). The upward or downward slope discussed here refers to the slope of this line.
Correlation 47

Chapter 2 and 3: basic Data handling koop

More Related Content

What's hot (20)

Similar to Chapter 2 and 3: basic Data handling koop (20)

More from FLBeS (20)

Recently uploaded (20)

Chapter 2 and 3: basic Data handling koop