Part II - Random Variables and Distributions
Cumulative Distribution Functions
The Median and Other Quantiles
Expected Values of Discrete Variables
The Variance and Standard Deviation
A random variable is function whose domain is the sample space Ω of a random experiment and whose codomain is the set of real numbers. Informally, a random variable is a numerical observation resulting from the outcome of a random experiment. For example, if the experiment consists of selecting a random sample of 5 members of a human population, A, the average age of the 5 members, is a random variable. M, the number of males in the sample, is another one.
The numerical values associated with experimental outcomes by a random variable may be mere tags or names for which arithmetic operations such as addition and multiplication are meaningless. Letters from the alphabet or spelled-out names would serve just as well. Such variables are called nominal variables, or sometimes factors. In contrast, numeric variables have values with true numerical significance and are often related to a scale of measurement.
Let X denote a random variable and let I denote an interval of real numbers. I can be any kind of interval - open, closed, degenerate (consisting of a single point), a half line, or the entire set of real numbers. The set of experimental outcomes ω for which the corresponding value of X lies in I is an event, denoted by [X e I]. This notation is modified to suit particular kinds of intervals, e.g., [0 < X ≤ 1], [Y > 2.4], [Z = -1].
Two or more random variables X1, X2, ... are jointly distributed if they arise from the same random experiment, i.e., are defined for the same sample space. This means that for each outcome of the experiment, the variables in the list all have values simultaneously. For example, if the experiment is to randomly select one member of a human population, the age, height, sex, and marital status of the person selected are jointly distributed variables.
If X is a random variable and x denotes an arbitrary real number, the event [X ≤ x] has a probability between 0 and 1. If x is allowed to vary over the set of all real numbers, this probability is a function of x. It is called the cumulative distribution function (cdf) of the random variable X. In symbols,
If X and Y are jointly distributed random variables, their joint cumulative distribution function is a function of two arguments, x and y. It is defined as
This notation can be extended to any number of jointly distributed random variables.
Jointly distributed random variables X1, X2, ..., Xn are independent if for any sequence of intervals I1, I2, ..., In
Informally, this means that if the values assumed by some of these random variables are known, that knowledge does not help in predicting the values assumed by others. In the experiment of selecting a random sample of 5 members of a human population, the average age A of the members in the sample and the number of males M in the sample are independent random variables. The average age and the largest age of members of the sample are not independent. They are dependent.
A median of a random variable X with cdf FX is any number m such that P[X ≤ m] ≥ 0.5 and P[X ≥ m] ≥ 0.5. Informally, this means that at least half the values of X are greater than or equal to m and at least half the values of X are less than or equal to m. In case there is more than one number m satisfying this condition, we usually let the median be the smallest number that does so. However, some authors define the median to be the middle number satisfying the condition. The median is also called the 50th percentile or the second quartile of the distribution of X.
If p is a number strictly between 0 and 1, the pth quantile, or 100pth percentile of the distribution of X is the smallest number q such that P[X ≤ q] ≥ p and P[X ≥ q] ≥ 1-p. The 25th percentile of a distribution is also called its first quartile and the 75th percentile is called the third quartile.
A random variable X is discrete if its values can be arranged in a finite or infinite sequence x1, x2, ... In contrast, a random variable of continuous type assumes all values in an interval of real numbers. For example, the number of heads in 10 successive tosses of a coin is a discrete random variable with possible values 0, 1, ... , 10. The average height of a random sample of 10 adult males from the U.S. population is much more conveniently treated as a random variable of continuous type.
If X is a discrete random variable with possible values x1, x2, ... , the probability mass function of X is a function whose domain is the set possible values. It is defined by
It is convenient to extend the domain of pX to the set of all real numbers x by defining pX(x) = 0 if x is not one of the xi.
Let X be a discrete random variable with probability mass function pX and let g be a real-valued function of a real argument x. The expected value of g(X) is defined as
If X has an infinite sequence of possible values, this is an infinite series and it is required that it be absolutely convergent. If we choose g(x) ≡ x, the expected value is called the mean of the random variable X. It is commonly denoted by the Greek letter m (mu).
Let X be a discrete random variable and let m be the mean of X. The variance of X is defined as
The standard deviation of X is the square root of the variance. It is usually denoted by the Greek letter s (sigma). Thus, the variance is also denoted by s2.