Data Collection and Analysis

Section lead: Sonja Likumahuwa-Ackman, MID, MPH

To conduct IS research, researchers must use existing data or prospectively collect data. This section examines the equity considerations a researcher must use when evaluating common data sources and preparing to collect data. Racial Equity Tools [SL1] has a useful 2-pager on possible equity concerns about using available (existing) data. This guide suggests reviewing five areas of potential issues:

  • Coverage: How much of the target group is included in the available data
  • Currency or timeliness: How recently the data were collected
  • Disaggregation: For what subgroups the data can be presented
  • Detail: How specific the information is in the areas of interest
  • Bias: What factors might potentially lead to misleading or inaccurate information

Table 5. Equity Considerations for Data

Types of Data Equity Considerations Specific Equity Considerations for Types of Data
Electronic Health Record (also called electronic medical record, EMR) Coverage EHR data are secondary data and include a person’s record of care, typically either inpatient or outpatient. Depending on how many visits a person has, the EHR record may be more or less complete. EHR data include a high level of detail about health care delivery. They also include demographics; though these are entered by clinical staff who may guess at a patient’s gender, race or ethnicity, for example. There is a typically a high percentage of unknown race/ethnicity data.
Currency Usually very timely, though it may take time to request and receive data.
Disaggregation Data can be disaggregated to many subpopulations of interest, including by date of visit, visit type, diagnosis, treatment, medication prescribed, and screening conducted.
Detail Follow-up to care is not available: for example, the EHR records medications prescribed, but we do not know whether the patient filled the prescription. A procedure or referral may be made, but we do not know whether the patient attended the appointment if it is outside the specific health care system (e.g., referral for mammogram at a standalone radiology center).
Bias EHR data reflects the many well-documented biases in the health care system, such as under- or over-diagnosis, and disparities in access based on income, language, transportation, and insurance status. There is bias in how clinic staff enter data, for example, assuming gender, race, or ethnicity without asking the patient how they identify.1
Insurance Claims Data
Examples: Medicare or Medicaid claims data; private insurance claims
Coverage Claims data are secondary data and include a person’s whole system of care (primary care, hospital, labs, potentially mental health, and dental). Claims data leave out people who are uninsured, and are incomplete for people who are discontinuously insured, because they do not generate insurance claims for their health care without insurance. Demographic data are included and usually thorough and accurate.
Currency Usually very timely, though it may take time to request and receive data.
Disaggregation Data can be disaggregated to many subpopulations of interest. Insurance claims data are not generated for research purposes, so they can be difficult to analyze.
Detail Claims data include the insurance-related information, such as billing codes, for the procedures and visits performed. They also reflect other claims, such as when a prescription is filled.
Bias Similar to EHR data, claims data reflect the biases of the health care system. The uninsured and discontinuously insured who are missing from claims data are disproportionately people of color, low-income, less English proficient, and less stably housed. 2
Patient/Disease Registries Coverage Registries are databases of secondary data that are limited to a specific disease, but within that disease they may be quite complete. Data registries such as state immunization registries are considered the gold standard for data on certain topics.3-4
Currency Varies, but for well-established registries such as state cancer registries, typically hospitals and clinics send data to registries in near real time. Others rely on patient-reported data and may be less timely.
Disaggregation High, depending on the demographic and disease data collected.
Detail Population registries may have very detailed data on an individual, standardized across the population with that disease. Details include date of diagnosis, severity of disease (e.g., cancer stage), biological samples (e.g., tumor, genetic sample), medical history, treatments, procedures, and medications.
Bias Due to biases in the health care system, some patients have unequal access to diagnoses and procedures that would qualify them for a registry, which can lead to bias within the registry. Registries can also shed light on rare diseases that impact a small number of people.
Health Surveys Coverage Health surveys are primary data collected for research purposes, so the data are relatively easy to analyze. Coverage often excludes non-English speakers, people without telephones or internet, and people with low literacy. Non-federal health surveys typically have low response rates, lower than 10%, meaning that they may not be representative of the general population.
Currency Large federal survey data typically are collected at least 2–3 years before becoming available to researchers. Smaller surveys may be available more quickly.
Disaggregation Survey data can be disaggregated to a few subpopulations of interest depending on the questions. Whether a survey is cross-sectional or longitudinal will impact what types of analyses can be done and what types of conclusions can be drawn.*
Detail Depends on the questions asked and the response scale provided. Data are standardized across respondents, which makes comparison easier, but may miss details of differences between respondents.
Bias There is bias in how health survey questions are written. Interviewers may differ slightly in how they ask questions, leading to different responses.
Trusted surveys The most trusted surveys are federal: National Health Interview Survey (Centers for Disease Control and Prevention), Medical Expenditure Panel Survey (Agency for Healthcare Research and Quality), Behavioral Risk Factor Surveillance System, National Health and Nutrition Examination Survey, National Immunization Survey, National Survey on Drug Use and Health, Medicare Current Beneficiary Survey, and Current Population Survey Annual Social and Economic Supplement. These have response rates of 50%–75%.
Focus Groups Coverage Focus groups are a type of primary data collection.
Currency Data are collected in real time. Focus groups generally are recorded and transcribed, or extensive notes are taken, then analyzed. This can take time depending on the complexity.
Disaggregation Limited to the sample of participants.
Detail Limited to the questions asked and the skill of the facilitator in getting participation from everyone in the group.
Bias The design of a focus group can have bias, from the power dynamics within the participants (will everyone feel comfortable speaking up?) to the location of the group (is it a neutral location?). The facilitator brings their individual bias to the question design and facilitation methods. There is also a bias within groups to agree with the most outspoken person, which may produce inaccurate results of what participants really think.
Interviews Coverage Interviews are primary data collection and are limited to the specific people who are interviewed.
Currency Data are collected in real time. Interviews generally are recorded and transcribed, then analyzed. This can take time depending on the complexity.
Disaggregation Data can be disaggregated based on the sample that was interviewed.
Detail A structured interview will give more comparability between respondents; while a semi-structured interview gives room for follow-up questions, which can yield important details.
Bias Like surveys, interviews are limited by access based on language, interview mode (telephone, internet, in-person), and literacy level. Sampling can also have bias.
Direct observation/ field notes Coverage Limited in scope, but potentially very rich in detail. Usually limited to a single setting and possibly a single location within that setting.
Currency Data are collected in real time during the observation; analysis of field notes may take time depending on complexity.
Disaggregation Using qualitative analysis software, disaggregation by code is possible, and themes may be pulled out from across multiple data collection sites. From a single site, there is limited disaggregation.
Detail Observation yields a very high level of detail.
Bias The observer/researcher brings their biases to the observation.
* For example, see:
Brief descriptions of many of these types of data:


  1. Young JC, Conover MM, Funk MJ. Measurement error and misclassification in electronic medical records: Methods to mitigate bias. Curr Epidemiol Rep. 2018 Dec;5(4):343-356. doi:10.1007/s40471-018-0164-x.
  2. Devoe JE, Gold R, McIntire P, Puro J, Chauvie S, Gallia CA. Electronic health records vs. Medicaid claims: Completeness of diabetes preventive care data in community health centers. Ann Fam Med. 2011;9(4):351-358. doi:10.1370/afm.1279
  3. Pop B, Fetica B, Blaga ML, et al. The role of medical registries, potential applications and limitations. Med Pharm Rep. 2019;92(1):7-14. doi:10.15386/cjmed-1015
  4. Campbell CI, Bahorik AL, VanVeldhuisen P, Weisner C, Rubinstein AL, Ray GT. Use of a prescription opioid registry to examine opioid misuse and overdose in an integrated health system. Prev Med. 2018 May;110:31-37. doi:10.1016/j.ypmed.2018.01.019