blake3

A pure-Go implementation of the BLAKE3 cryptographic hash function

This repository has been archived on 2023-05-01. You can view files and clone it, but cannot push or open issues or pull requests.

Go to file

lukechampine bb7ece4161 upgrade to avo@v0.4.0 (AVX-512 support, woo!)		2021-11-12 23:09:50 -05:00
avo	upgrade to avo@v0.4.0 (AVX-512 support, woo!)	2021-11-12 23:09:50 -05:00
testdata	add AVX512 implementations	2020-08-09 18:14:48 -04:00
LICENSE	initial commit	2020-01-09 15:10:01 -05:00
README.md	fix typo	2020-08-10 18:11:31 -04:00
blake3.go	add AVX512 implementations	2020-08-09 18:14:48 -04:00
blake3_amd64.s	upgrade to avo@v0.4.0 (AVX-512 support, woo!)	2021-11-12 23:09:50 -05:00
blake3_test.go	add AVX512 implementations	2020-08-09 18:14:48 -04:00
compress_amd64.go	add AVX512 implementations	2020-08-09 18:14:48 -04:00
compress_generic.go	add AVX512 implementations	2020-08-09 18:14:48 -04:00
compress_noasm.go	add AVX512 implementations	2020-08-09 18:14:48 -04:00
cpu.go	upgrade to cpuid/v2	2021-09-06 12:38:15 -04:00
cpu_darwin.go	upgrade to cpuid/v2	2021-09-06 12:38:15 -04:00
go.mod	upgrade to cpuid/v2	2021-09-06 12:38:15 -04:00
go.sum	upgrade to cpuid/v2	2021-09-06 12:38:15 -04:00

README.md

blake3

go get lukechampine.com/blake3

blake3 implements the BLAKE3 cryptographic hash function. This implementation aims to be performant without sacrificing (too much) readability, in the hopes of eventually landing in x/crypto.

In addition to the pure-Go implementation, this package also contains AVX-512 and AVX2 routines (generated by avo) that greatly increase performance for large inputs and outputs.

Contributions are greatly appreciated. All contributors are eligible to receive an Urbit planet.

Benchmarks

Tested on a 2020 MacBook Air (i5-7600K @ 3.80GHz). Benchmarks will improve as soon as I get access to a beefier AVX-512 machine. 😉

AVX-512

BenchmarkSum256/64           120 ns/op       533.00 MB/s
BenchmarkSum256/1024        2229 ns/op       459.36 MB/s
BenchmarkSum256/65536      16245 ns/op      4034.11 MB/s
BenchmarkWrite               245 ns/op      4177.38 MB/s
BenchmarkXOF                 246 ns/op      4159.30 MB/s

AVX2

BenchmarkSum256/64           120 ns/op       533.00 MB/s
BenchmarkSum256/1024        2229 ns/op       459.36 MB/s
BenchmarkSum256/65536      31137 ns/op      2104.76 MB/s
BenchmarkWrite               487 ns/op      2103.12 MB/s
BenchmarkXOF                 329 ns/op      3111.27 MB/s

Pure Go

BenchmarkSum256/64           120 ns/op       533.00 MB/s
BenchmarkSum256/1024        2229 ns/op       459.36 MB/s
BenchmarkSum256/65536     133505 ns/op       490.89 MB/s
BenchmarkWrite              2022 ns/op       506.36 MB/s
BenchmarkXOF                1914 ns/op       534.98 MB/s

Shortcomings

There is no assembly routine for single-block compressions. This is most noticeable for ~1KB inputs.

Each assembly routine inlines all 7 rounds, causing thousands of lines of duplicated code. Ideally the routines could be merged such that only a single routine is generated for AVX-512 and AVX2, without sacrificing too much performance.