Descriptive statistics: describe the data.
Summarize (statistics), tabulate (frequency distribution), graph (histogram, etc.)

Data

Data
Qualitative (nominal, categorical)
words
Quantitative
numbers.
Discrete Continuous
integers real nos. (decimals)
counts measurements

Levels of measurement
Level Examples What can do with
Nominal names, labels, categories Yes/No, Agree/Disagree, Have/Havenot, Success/Failure, M/F, ...
MaritalStatus, State, County, Zipcode, Major, Brand,make,model,color, Place
race,religion,party,ideology..., TaxFilingStatus, Blood type, Housing type, Pet
Count/tally each category. Relative frequency. Mode. Bar chart.
Chi-square Tests (independence, goodness-of-fit)
Confidence interval 1-PropZInt
Ordinal orderable/rankable categories
but differences (obtained by subtraction) between data values either cannot be determined or are meaningless.
class(frosh/soph/jun/sen), trim levels, film ratings, gold/silver/bronze, letter grades, days of week, months, Education level, clothing sizes, pain scales, military rank, star ratings, priority/risk levels
Percentiles.
Likert scale:
Strongly disagree / Disagree / Neutral / Agree / Strongly agree
Very dissatisfied / Dissatisfied / Neutral / Satisfied / Very satisfied
Poor / Fair / Good / Very good / Excellent
Above + median/quartiles, Spearman.
Interval Numbers: orderable, and differences between data values can be found and are meaningful. But no natural zero (meaning none of the quantity). Temperature C or F, Years/Dates, shoe size, IQ/SAT, FICO, pH, BMI
0 is fakish
histogram, mean, median, SD...
Estimation, CI: t-test,
Hypothesis testing,
ANOVA,
correlation, linear regression
Ratio Numbers: orderable, and differences between data values can be found and are meaningful, and natural zero (meaning none of the quantity), and ratios (eg. "twice as much") are meanginful. Weight Height Age
Length Area Volume
Time Money TemperatureK Energy
BP LDL
DJI S&P500
counts
Above + CV, GM,

Measurements have some measuring unit, e.g. inches, pounds, meters, minutes, acres, grams, MPH, BPM, ng/L, ... but they are basically irrelevant for the statistical work.


Data "set" (but can have duplicates) consisting of observations/data values/measurements/datums/individuals/scores, all the same meaning, e.g. weights of adults, greasiness of bags of chips, longevity of bulbs, widget regional sales, effect of pill...

Some Triola data
Some data distros


Example: Population: weights of adults in country/county.
Not possible to census this. So need a non-biased, representative sample (a teaspoon of the pot of soup).
Ideal: Simple random sample (SRS): every adult equally-likely to be in the sample and every sample of that size is equally-likely.
  Bad: voluntary response, convenience sample.
Collect data. Measured vs self-reported (unreliable).
Calculate/derive statistic from the data: a point estimate of the parameter. But samples have uncertainty/variability so determine [confidence] interval estimate.
Inferential statistics: use probability to understand/quantify/describe uncertainty.
If have census, i.e. population is all known, no need to sample, just describe the population. Sample(s) only useful/taken/needed to estimate population parameter(s).


interval: set of continuous numbers
[1.45,3.7]
random stochastic aleatory chance luck mis/fortune contingent accidental fate fortuitous haphazard
Data in Excel file (.xlsx). Open it in Excel. Select column, copy, then paste into other SW.

Data in Text file (.txt, .dat) or webpage, in column (of many columns, each a different data set):
  Open it or Import it in Excel. Select column, copy, then paste into other SW.
    OR
  Open it in Notepad and then select all (Ctrl-A) then copy (Ctrl-C) and paste into Excel. Select column, copy, then paste into other SW.
BodyTemperatures.txt


Dot plot.

Stem-and-leaf plot.
Data: 44, 46, 47, 49, 63, 64, 66, 68, 68, 72, 72, 75, 76, 81, 84, 88, 106

train schedule