About the Super SURF
The New Zealand Income Survey (NZIS) Super SURF is a set of 100 synthetic unit-record files (SURFs) based on data from the NZIS: June 2003 quarter. Each of the SURFs (called sub-SURFs) contains around 11,000 records with a mixture of categorical and numerical variables. Together they act like 100 samples taken from a portion of the New Zealand population between 25 and 64 years of age who participate in paid work. This synthetic data should not be used as a source of accurate statistical information, but it is a largely realistic representation of a portion of the New Zealand population and can be used for teaching and learning purposes. It can also be used as a source of unit-record data for developing analytical methods or statistical processes.
The datasets are provided in the form of two files that can be used in any data analysis software package. A data dictionary containing information about the variables in the SURFs is also provided as an available file at the top of this page.
- NZIS sub-SURF 1: This Excel file (482KB) contains a single sub-SURF (11,315 records).
- NZIS Super SURF: This comma-delimited text file (33.6MB) contains the entire Super SURF (made up of 100 sub-SURFs).
About the New Zealand Income Survey
The NZIS is run annually during the June quarter (April to June), as a supplement to the Household Labour Force Survey (HLFS). All respondents in the HLFS were asked to participate in the NZIS, which provides a snapshot of income levels for people and households. NZIS data gives average weekly income for the June quarter from most sources, including government transfers, investments, self-employment, and wages and salaries.
June 2003 quarter results were published in the NZIS: June 2003 quarter Hot Off the Press on 1 October 2003. More information, including questionnaires and technical notes, can be found on the NZIS resource web page.
Using the Super SURF
The NZIS Super SURF is a source of unit-record data that can be used for teaching and learning purposes, or for developing analytical methods or statistical processes. Any one sub-SURF can be used on its own for learning or teaching activities involving around 11,000 records. We provide one sub-SURF in a separate file for this purpose. Any number of sub-SURFs can also be combined to form a larger dataset. For example, combining five sub-SURFs will form a dataset of around 56,000 records. In order to preserve the statistical properties of the synthetic data, whole sub-SURFs should be used rather than partial ones.
Multiple sub-SURFs can also be used to demonstrate statistical inference. Each sub-SURF is like one sample drawn from the population, and by calculating the same measure (such as mean weekly income) for each sub-SURF the sampling distribution and confidence intervals can then be calculated for that measure. In real sampling situations (such as Statistics NZ household surveys) only one sample is taken, and resampling methods are usually used to calculate sampling error for a particular measure (such as mean weekly income). Multiple sub-SURFs allow users to create sampling distributions for any statistics (for example, regression coefficients).
Properties of the Super SURF
All users should note the following properties of the Super SURF.
The Super SURF is made up of 100 sub-SURFs, and each sub-SURF contains 11,315 unit records based on a portion of the NZIS: June 2003 quarter, not the whole survey. The portion includes those who were between 25 and 64 years of age, reported working at least one hour per week, and had a positive average weekly income at the time of the survey. The data is synthetic but key statistical measures of the Super SURF and the sub-SURFs such as mean are similar to those in the real data. Relationships between the variables and the distributions of the two numerical variables (weekly income and weekly hours) also imitate the real data. These properties also apply for any combination of the sub-SURFs.
The following variables are included in the Super SURF, and more detail is included in the data dictionary file (accessible as an available file at the top of this page). The categorical variables in the Super SURF file are stored as numeric codes rather than character text, so the data dictionary file will need to be consulted in order to fully interpret the data. Note that the income variable is weekly income from all sources, as opposed to income from wages and salary only.
- Person ID: a random unique ID number for each record in the Super SURF
- Surf ID: a numerical identifier for each sub-SURF
- Age: 25 to 64 years in 5-year age bands
- Sex: male and female
- Ethnicity: six categories
- Highest qualification: five categories
- Weekly hours: total usual weekly hours worked from all wages and salary jobs
- Weekly income: total usual weekly income from all sources.