filer/perf/simple-statistics/README.md

338 lines
11 KiB
Markdown

[![Build Status](https://secure.travis-ci.org/tmcw/simple-statistics.png?branch=master)](http://travis-ci.org/tmcw/simple-statistics) [![Coverage Status](https://coveralls.io/repos/tmcw/simple-statistics/badge.png)](https://coveralls.io/r/tmcw/simple-statistics)
A JavaScript implementation of descriptive, regression, and inference statistics.
Implemented in literate JavaScript with no dependencies, designed to work
in all modern browsers (including IE) as well as in node.js.
## [API Documentation](API.md)
---
Basic contracts of functions:
* Functions do not modify their arguments e.g. change their order
* Invalid input, like empty lists to functions that need 1+ items to work, will cause functions to return `null`.
# Basic Array Operations
### .mixin(array)
_Optionally_ mix in the following functions into the `Array` prototype. Otherwise
you can use them off of the simple-statistics object itself.
If given a particular array instance as an argument, this adds the functions
only to that array rather than the global `Array.prototype`. Without an argument,
it runs on the global `Array.prototype`.
### .mean(x)
Mean of a single-dimensional Array of numbers. _Also available as `.average(x)`_
### .sum(x)
Sum of a single-dimensional Array of numbers.
### .mode(x)
Returns the number that appears most frequently in a single-dimensional Array
of numbers. If there are multiple modes, the one that appears last
is returned.
### .variance(x)
[Variance](http://en.wikipedia.org/wiki/Variance) of a single-dimensional Array of numbers.
### .standard_deviation(x)
[Standard Deviation](http://en.wikipedia.org/wiki/Standard_deviation) of a single-dimensional Array of numbers.
### .median_absolute_deviation(x)
The Median Absolute Deviation (MAD) is a robust measure of statistical
dispersion. It is more resilient to outliers than the standard deviation.
Accepts a single-dimensional array of numbers and returns a dispersion value.
Also aliased to `.mad(x)` for brevity.
### .median(x)
[Median](http://en.wikipedia.org/wiki/Median) of a single-dimensional array of numbers.
### .geometric_mean(x)
[Geometric mean](http://en.wikipedia.org/wiki/Geometric_mean) of a single-dimensional array of **positive** numbers.
### .harmonic_mean(x)
[Harmonic mean](http://en.wikipedia.org/wiki/Harmonic_mean) of a single-dimensional array of **positive** numbers.
### .root_mean_square(x)
[Root mean square (RMS)](http://en.wikipedia.org/wiki/Root_mean_square) of a single-dimensional array of numbers.
### .min(x)
Finds the minimum of a single-dimensional array of numbers. This runs in linear `O(n)` time.
### .max(x)
Finds the maximum of a single-dimensional array of numbers. This runs in linear `O(n)` time.
### .t_test(sample, x)
Does a [student's t-test](http://en.wikipedia.org/wiki/Student's_t-test) of a dataset `sample`, represented by a single-dimensional array of numbers. `x` is the known value, and the result is a measure of [statistical significance](http://en.wikipedia.org/wiki/Statistical_significance).
### .t_test_two_sample(sample_x, sample_y, difference)
The two-sample t-test is used to compare samples from two populations or groups,
confirming or denying the suspicion (null hypothesis) that the populations are
the same. It returns a t-value that you can then look up to give certain
judgements of confidence based on a t distribution table.
This implementation expects the samples `sample_x` and `sample_y` to be given
as one-dimensional arrays of more than one number each.
### .sample_variance(x)
Produces [sample variance](http://mathworld.wolfram.com/SampleVariance.html)
of a single-dimensional array of numbers.
### .sample_covariance(a, b)
Produces [sample covariance](http://en.wikipedia.org/wiki/Sample_mean_and_sample_covariance)
of two single-dimensional arrays of numbers.
### .sample_correlation(a, b)
Produces [sample correlation](http://en.wikipedia.org/wiki/Correlation_and_dependence)
of two single-dimensional arrays of numbers.
### .quantile(sample, p)
Does a [quantile](http://en.wikipedia.org/wiki/Quantile) of a dataset `sample`,
at p. For those familiary with the `k/q` syntax, `p == k/q`. `sample` must
be a single-dimensional array of numbers. p must be a number greater than or equal to zero and less than or equal to one, or an array of numbers following that rule.
If an array is given, an array of results will be returned instead of a single
number.
### .chunk(sample, chunkSize)
Given a `sample` array, and a positive integer `chunkSize`, splits an array
into chunks of `chunkSize` size and returns an array of those chunks. This
does not change the input value. If the length of `sample` is not divisible
by `chunkSize`, the last array will be shorter than the rest.
### .quantile_sorted(sample, p)
Does a [quantile](http://en.wikipedia.org/wiki/Quantile) of a dataset `sample`,
at p. `sample` must be a one-dimensional _sorted_ array of numbers, and
`p` must be a single number greater than or equal to zero and less than or equal to one.
### .iqr(sample)
Calculates the [Interquartile range](http://en.wikipedia.org/wiki/Interquartile_range) of
a sample - the difference between the upper and lower quartiles. Useful
as a measure of dispersion.
_Also available as `.interquartile_range(x)`_
### .sample_skewness(sample)
Calculates the [skewness](http://en.wikipedia.org/wiki/Skewness) of
a sample, a measure of the extent to which a probability distribution of a
real-valued random variable "leans" to one side of the mean.
The skewness value can be positive or negative, or even undefined.
This implementation uses the [Fisher-Pearson standardized moment coefficient](http://en.wikipedia.org/wiki/Skewness#Pearson.27s_skewness_coefficients),
which means that it behaves the same as Excel, Minitab, SAS, and SPSS.
Skewness is only valid for samples of over three values.
### .jenks(data, number_of_classes)
Find the [Jenks Natural Breaks](http://en.wikipedia.org/wiki/Jenks_natural_breaks_optimization) for
a single-dimensional array of numbers as input and a desired `number_of_classes`.
The result is a single-dimensional with class breaks, including the minimum
and maximum of the input array.
### .r_squared(data, function)
Find the [r-squared](http://en.wikipedia.org/wiki/Coefficient_of_determination) value of a particular dataset, expressed as a two-dimensional `Array` of numbers, against a `Function`.
var r_squared = ss.r_squared([[1, 1]], function(x) { return x * 2; });
### .cumulative_std_normal_probability(z)
Look up the given `z` value in a [standard normal table](http://en.wikipedia.org/wiki/Standard_normal_table)
to calculate the probability of a random variable appearing with a given value.
### .z_score(x, mean, standard_deviation)
The standard score is the number of standard deviations an observation
or datum is above or below the mean.
### .standard_normal_table
A [standard normal table](http://en.wikipedia.org/wiki/Standard_normal_table) from
which to pull values of Φ (phi).
## Regression
### .linear_regression()
Create a new linear regression solver.
#### .data([[1, 1], [2, 2]])
Set the data of a linear regression. The input is a two-dimensional array of numbers, which are treated as coordinates, like `[[x, y], [x1, y1]]`.
#### .line()
Get the linear regression line: this returns a function that you can
give `x` values and it will return `y` values. Internally, this uses the `m()`
and `b()` values and the classic `y = mx + b` equation.
var linear_regression_line = ss.linear_regression()
.data([[0, 1], [2, 2], [3, 3]]).line();
linear_regression_line(5);
#### .m()
Just get the slope of the fitted regression line, the `m` component of the full
line equation. Returns a number.
#### .b()
Just get the y-intercept of the fitted regression line, the `b` component
of the line equation. Returns a number.
## Classification
### .bayesian()
Create a naïve bayesian classifier.
### .train(item, category)
Train the classifier to classify a certain item, given as an object with keys,
to be in a certain category, given as a string.
### .score(item)
Get the classifications of a certain item, given as an object of
`category -> score` mappings.
var bayes = ss.bayesian();
bayes.train({ species: 'Cat' }, 'animal');
bayes.score({ species: 'Cat' });
// { animal: 1 }
---
## [Literate Source](http://macwright.org/simple-statistics/)
## Usage
To use it in browsers, grab [simple_statistics.js](https://raw.github.com/tmcw/simple-statistics/master/src/simple_statistics.js).
To use it in node, install it with [npm](https://npmjs.org/) or add it to your package.json.
npm install simple-statistics
To use it with [component](https://github.com/component/component),
component install tmcw/simple-statistics
To use it with [bower](http://bower.io/),
bower install simple-statistics
## Basic Descriptive Statistics
```javascript
// Require simple statistics
var ss = require('simple-statistics');
// The input is a simple array
var list = [1, 2, 3];
// Many different descriptive statistics are supported
var sum = ss.sum(list),
mean = ss.mean(list),
min = ss.min(list),
geometric_mean = ss.geometric_mean(list),
max = ss.max(list),
quantile = ss.quantile(0.25);
```
## Linear Regression
```javascript
// For a linear regression, it's a two-dimensional array
var data = [ [1, 2], [2, 3] ];
// simple-statistics can produce a linear regression and return
// a friendly javascript function for the line.
var line = ss.linear_regression()
.data(data)
.line();
// get a point along the line function
line(0);
var line = ss.linear_regression()
// Get the r-squared value of the line estimation
ss.r_squared(data, line);
```
### Bayesian Classifier
```javascript
var bayes = ss.bayesian();
bayes.train({ species: 'Cat' }, 'animal');
bayes.score({ species: 'Cat' });
// { animal: 1 }
```
### Mixin Style
_This is **optional** and not used by default. You can opt-in to mixins
with `ss.mixin()`._
This mixes `simple-statistics` methods into the Array prototype - note that
[extending native objects](http://perfectionkills.com/extending-native-builtins/) is a
tricky move.
This will _only work_ if `defineProperty` is available, which means modern browsers
and nodejs - on IE8 and below, calling `ss.mixin()` will throw an exception.
```javascript
// mixin to Array class
ss.mixin();
// The input is a simple array
var list = [1, 2, 3];
// The same descriptive techniques as above, but in a simpler style
var sum = list.sum(),
mean = list.mean(),
min = list.min(),
max = list.max(),
quantile = list.quantile(0.25);
```
## Examples
* [Linear regression with simple-statistics and d3js](http://bl.ocks.org/3931800)
* [Jenks Natural Breaks with a choropleth map with d3js](http://bl.ocks.org/tmcw/4969184)
# Contributors
* Tom MacWright
* [Matt Sacks](https://github.com/mattsacks)
* Doron Linder
* [Alexander Sicular](https://github.com/siculars)