Data Collection

Generally, questionnaires facilitate data collection and can be used to structure the data in preparation for analysis.

Questionnaire Design

In order to ask the right questions, it is best to start with the desired output of analysis tables that will be used to test the stated hypotheses. These tables will have the data elements that must be collected, or derived from other data.

Questionnaire development

In practice, there is often an iterative process of questionnaire design and analysis planning until the desired product is achieved. The key point is that data collection MUST be designed with the planned analysis in mind. In other words, the analysis is planned before the data are collected. An easy trap is to collect what is thought is needed and then proceed with the analysis. Invariably, essential data elements will have been forgotten. The hypothesis being tested drives the analysis. It is often helpful to create empty tables or "table templates" showing the column and row headings as well as charts with the x and y axes labelled so that you can visualize what the desired results section of the report will look like. This can be immensely valuable in identifying the data elements that really need to be collected.

Tables for Cases:

The following six tables are some sample table templates that one might create in planning an analytic study of a gastroenteritis outbreak. The hypothesis being tested in this example was that eggs from a local farm were the source of the exposure.

Number of Cases

Frequency of Symptoms


Duration of Symptoms


Serious Consequences



Tables Comparing Cases and Controls:

Demographic Characteristics of Cases and Controls 


History of Exposures (by number)


2x2 Table for Exposure to Farm Eggs


The preceding tables are just examples and there might be many more tables, as well as figures. The key is to imagine the results section of the report and work backwards to identify the data elements and questions. It is important that every statement be written clearly so that it can be understood by others. For example, in the Exposure table above, what is meant by “bought eggs from farm "? Does this mean the eggs were just bought, or were they bought and eaten?

The outline of the hypothesis-testing questionnaire can be constructed based on the earlier questions posed in the hypothesis-generating questionnaire:

  • Introducing oneself, value of the study, and who is sponsoring the survey;

  • Personal identifiers and follow-up information;

  • Demographic information (age, sex, etc.);

  • Outcomes (disease): including sufficient information to know whether a person meets the case definition;

  • Exposures and other risk factors.

Managing Data

Data can be collected either on paper forms or be directly entered electronically.  Electronic applications often have a way to set up data validity checks that allow only the entry of responses with specific values.

Categorization of data should occur prior to or at the time of data entry. For example, "male" and "female" should be coded for ease of data entry and to avoid entry errors. Is 1 used for "male" and 2 for "female", M and F, or something else? A detailed record of coding schemes is a fundamental requirement. Creating a simple table that lists variable names, their format, and coding is invaluable when it comes time to analyze the data or when reviewing data months later (see the figure below). Other columns are possible to identify the type of variable, its length and format, and if comments are desired. Many software packages will do this automatically and it provides a history of edits so changes can be tracked over time.

To minimize data entry errors, some data entry programs allow the creation of drop-down menus within the data-entry form. The creation of analysis tables then has pre-coded categories already in place. The choice of software, or whether a computer is even used will depend on personal and/or group experience in that area. Like any other tool, it is best to learn how to use programs beforehand. Outbreaks are good learning experiences, but they are probably the wrong time to learn how to use a computer for data management.

Use of such software is beyond the scope of this module. Self-instruction materials are available at each web site.


Data Dictionary Example