Data types and sampling methods are crucial in statistics. Qualitative data describes attributes, while quantitative data uses numbers. Understanding these helps in choosing the right analysis techniques for different kinds of information.
Random sampling ensures unbiased representation of a population. Simple random, stratified, cluster, and systematic sampling are common methods. Each has its strengths, helping researchers gather accurate data for various study designs and population types.
Types of Data and Sampling Methods
Qualitative vs quantitative data
- Qualitative data consists of non-numeric attributes, characteristics, or categories that describe the data
- Cannot be measured or counted numerically
- Analyzed using frequencies, proportions, or percentages
- Examples: hair color (blonde, brown, black, red), movie genres (action, comedy, drama, horror), phone brands (Apple, Samsung, Google, OnePlus)
- Quantitative data is numeric and can be measured, counted, or expressed using numbers
- Discrete quantitative data has countable values, often integers
- Represents a fixed number of possible values
- Examples: number of pets owned (0, 1, 2, 3), number of languages spoken (1, 2, 3, 4), number of bedrooms in a house (1, 2, 3, 4, 5)
- Continuous quantitative data is measurable and can take on any value within a range
- Represents an infinite number of possible values
- Examples: body temperature (98.6°F, 99.2°F, 100.5°F), distance traveled (5.2 miles, 10.7 miles, 26.4 miles), time spent studying (1.5 hours, 2.75 hours, 4.33 hours)
Interpretation of two-way tables
- Two-way tables, also known as contingency tables, display frequencies or counts of data based on two categorical variables
- Rows represent levels of one variable, while columns represent levels of the other variable
- Each cell shows the frequency or count for a specific combination of the two variables
- Marginal distributions provide information about a single variable, ignoring the other variable
- Found by summing the frequencies or counts across each row or column
- Row totals and column totals represent the marginal distributions
- Marginal probability is calculated as:
- $P(A) = \frac{\text{Total in row A}}{\text{Grand total}}$
- $P(B) = \frac{\text{Total in column B}}{\text{Grand total}}$
- Conditional distributions show the frequencies or proportions of one variable, given a specific level of the other variable
- Calculated by dividing a cell value by its corresponding row or column total
- Conditional probability is calculated as:
- $P(A|B) = \frac{P(A \cap B)}{P(B)}$, read as "probability of A given B"
- $P(B|A) = \frac{P(A \cap B)}{P(A)}$, read as "probability of B given A"
- Interpretation: "Given that event B has occurred, what is the probability of event A occurring?" or vice versa
Methods of random sampling
- Simple random sampling ensures that each member of the population has an equal chance of being selected
- Randomly select a sample of size n from a population of size N
- Provides an unbiased and representative sample
- Example: randomly selecting 100 students from a school of 1,000 students using a random number generator
- Stratified sampling involves dividing the population into homogeneous subgroups (strata) based on a specific characteristic
- Simple random sampling is then performed within each stratum
- Ensures representation of all subgroups in the sample
- Example: dividing a city's population by income levels (low, medium, high) and randomly sampling from each income level
- Cluster sampling divides the population into naturally occurring groups (clusters)
- Randomly select a sample of clusters, and include all members within the selected clusters
- Useful when a complete list of population members is not available
- Example: randomly selecting 10 city blocks and surveying all households within those blocks
- Systematic sampling starts by randomly selecting a starting point from the population
- Choose every kth element from the list, where k = N/n (population size divided by sample size)
- Ensures even distribution of the sample across the population
- Potential for sampling bias if there is a periodic pattern in the population list
- Example: selecting every 10th person from a list of 1,000 people, starting at a randomly chosen point
Measures of Variability and Statistical Inference
- Standard deviation measures the average distance between each data point and the mean
- Provides insight into the spread or dispersion of the data
- Calculated as the square root of the variance
- Variance quantifies the average squared deviation from the mean
- Useful for comparing the spread of different datasets
- Larger variance indicates greater variability in the data
- Central limit theorem states that the sampling distribution of the sample mean approaches a normal distribution as the sample size increases
- Applies regardless of the shape of the population distribution
- Enables the use of normal distribution properties for statistical inference
- Sampling distribution represents the distribution of a statistic (e.g., sample mean) for all possible samples of a given size from a population
- Provides information about the variability of the statistic across different samples
- Confidence interval is a range of values that likely contains the true population parameter
- Based on the sample statistic and its standard error
- Wider intervals indicate less precision in the estimate