blake3/README.md

blake3
------

[![GoDoc](https://godoc.org/lukechampine.com/blake3?status.svg)](https://godoc.org/lukechampine.com/blake3)
[![Go Report Card](http://goreportcard.com/badge/lukechampine.com/blake3)](https://goreportcard.com/report/lukechampine.com/blake3)

```
go get lukechampine.com/blake3
```

`blake3` implements the [BLAKE3 cryptographic hash function](https://github.com/BLAKE3-team/BLAKE3).
This implementation aims to be performant without sacrificing (too much)
readability, in the hopes of eventually landing in `x/crypto`.

In addition to the pure-Go implementation, this package also contains AVX-512
and AVX2 routines (generated by [`avo`](https://github.com/mmcloughlin/avo))
that greatly increase performance for large inputs and outputs.

Contributions are greatly appreciated.
[All contributors are eligible to receive an Urbit planet.](https://twitter.com/lukechampine/status/1274797924522885134)


## Benchmarks

Tested on a 2020 MacBook Air (i5-7600K @ 3.80GHz). Benchmarks will improve as
soon as I get access to a beefier AVX-512 machine. :wink:

### AVX-512

```
BenchmarkSum256/64           120 ns/op       533.00 MB/s
BenchmarkSum256/1024        2229 ns/op       459.36 MB/s
BenchmarkSum256/65536      16245 ns/op      4034.11 MB/s
BenchmarkWrite               245 ns/op      4177.38 MB/s
BenchmarkXOF                 246 ns/op      4159.30 MB/s
```

### AVX2

```
BenchmarkSum256/64           120 ns/op       533.00 MB/s
BenchmarkSum256/1024        2229 ns/op       459.36 MB/s
BenchmarkSum256/65536      31137 ns/op      2104.76 MB/s
BenchmarkWrite               487 ns/op      2103.12 MB/s
BenchmarkXOF                 329 ns/op      3111.27 MB/s
```

### Pure Go

```
BenchmarkSum256/64           120 ns/op       533.00 MB/s
BenchmarkSum256/1024        2229 ns/op       459.36 MB/s
BenchmarkSum256/65536     133505 ns/op       490.89 MB/s
BenchmarkWrite              2022 ns/op       506.36 MB/s
BenchmarkXOF                1914 ns/op       534.98 MB/s
```

## Shortcomings

There is no assembly routine for single-block compressions. This is most
noticeable for ~1KB inputs.

Each assembly routine inlines all 7 rounds, causing thousands of lines of
duplicated code. Ideally the routines could be merged such that only a single
routine is generated for AVX-512 and AVX2, without sacrificing too much
performance.
initial commit 2020-01-09 20:10:01 +00:00			`blake3`
			`------`

			`[![GoDoc](https://godoc.org/lukechampine.com/blake3?status.svg)](https://godoc.org/lukechampine.com/blake3)`
			`[![Go Report Card](http://goreportcard.com/badge/lukechampine.com/blake3)](https://goreportcard.com/report/lukechampine.com/blake3)`

			```
			`go get lukechampine.com/blake3`
			```

			`blake3` implements the [BLAKE3 cryptographic hash function](https://github.com/BLAKE3-team/BLAKE3).
add AVX2 implementation 2020-07-30 17:54:11 +00:00			`This implementation aims to be performant without sacrificing (too much)`
			readability, in the hopes of eventually landing in `x/crypto`.
initial commit 2020-01-09 20:10:01 +00:00
add AVX512 implementations 2020-08-04 22:13:57 +00:00			`In addition to the pure-Go implementation, this package also contains AVX-512`
			and AVX2 routines (generated by [`avo`](https://github.com/mmcloughlin/avo))
			`that greatly increase performance for large inputs and outputs.`
add AVX2 implementation 2020-07-30 17:54:11 +00:00
			`Contributions are greatly appreciated.`
			`[All contributors are eligible to receive an Urbit planet.](https://twitter.com/lukechampine/status/1274797924522885134)`


			`## Benchmarks`

add AVX512 implementations 2020-08-04 22:13:57 +00:00			`Tested on a 2020 MacBook Air (i5-7600K @ 3.80GHz). Benchmarks will improve as`
fix typo 2020-08-10 22:11:31 +00:00			`soon as I get access to a beefier AVX-512 machine. :wink:`
add AVX2 implementation 2020-07-30 17:54:11 +00:00
add AVX512 implementations 2020-08-04 22:13:57 +00:00			`### AVX-512`

			```
			`BenchmarkSum256/64 120 ns/op 533.00 MB/s`
			`BenchmarkSum256/1024 2229 ns/op 459.36 MB/s`
			`BenchmarkSum256/65536 16245 ns/op 4034.11 MB/s`
			`BenchmarkWrite 245 ns/op 4177.38 MB/s`
			`BenchmarkXOF 246 ns/op 4159.30 MB/s`
			```

			`### AVX2`

			```
			`BenchmarkSum256/64 120 ns/op 533.00 MB/s`
			`BenchmarkSum256/1024 2229 ns/op 459.36 MB/s`
			`BenchmarkSum256/65536 31137 ns/op 2104.76 MB/s`
			`BenchmarkWrite 487 ns/op 2103.12 MB/s`
			`BenchmarkXOF 329 ns/op 3111.27 MB/s`
add AVX2 implementation 2020-07-30 17:54:11 +00:00			```
add AVX512 implementations 2020-08-04 22:13:57 +00:00
			`### Pure Go`

			```
			`BenchmarkSum256/64 120 ns/op 533.00 MB/s`
			`BenchmarkSum256/1024 2229 ns/op 459.36 MB/s`
			`BenchmarkSum256/65536 133505 ns/op 490.89 MB/s`
			`BenchmarkWrite 2022 ns/op 506.36 MB/s`
			`BenchmarkXOF 1914 ns/op 534.98 MB/s`
add AVX2 implementation 2020-07-30 17:54:11 +00:00			```
add AVX512 implementations 2020-08-04 22:13:57 +00:00
			`## Shortcomings`

			`There is no assembly routine for single-block compressions. This is most`
			`noticeable for ~1KB inputs.`

			`Each assembly routine inlines all 7 rounds, causing thousands of lines of`
			`duplicated code. Ideally the routines could be merged such that only a single`
			`routine is generated for AVX-512 and AVX2, without sacrificing too much`
			`performance.`