Character encoding detection tool for NodeJS

Go to file

zevanty 0c775fa651 Add option to return all matches		2018-04-22 00:57:54 -07:00
encoding	v0.1.0	2015-11-30 08:51:34 +11:00
scripts	Fix release script	2017-10-16 13:40:51 +11:00
test	Add option to return all matches	2018-04-22 00:57:54 -07:00
.gitignore	v0.1.0	2015-11-30 08:51:34 +11:00
.npmignore	Add `yarn.lock` to `.npmignore`.	2018-01-25 19:49:49 -07:00
.travis.yml	Upgrade deps, yarn lock	2017-10-16 11:52:35 +11:00
LICENSE	Upgrade deps, yarn lock	2017-10-16 11:52:35 +11:00
README.md	Add option to return all matches	2018-04-22 00:57:54 -07:00
index.js	Add option to return all matches	2018-04-22 00:57:54 -07:00
match.js	major style changes for rest of the files	2013-11-22 15:37:41 +11:00
package.json	0.4.2	2017-11-27 08:03:17 +11:00
yarn.lock	add missing github-publish-release to dev deps	2017-10-16 13:41:50 +11:00

README.md

chardet

Chardet is a character detection module for NodeJS written in pure Javascript. Module is based on ICU project http://site.icu-project.org/, which uses character occurency analysis to determine the most probable encoding.

Installation

npm i chardet

Usage

var chardet = require('chardet');
chardet.detect(new Buffer('hello there!'));
// or
chardet.detectFile('/path/to/file', function(err, encoding) {});
// or
chardet.detectFileSync('/path/to/file');

Working with large data sets

Sometimes, when data set is huge and you want to optimize performace (in tradeoff of less accuracy), you can sample only first N bytes of the buffer:

chardet.detectFile('/path/to/file', { sampleSize: 32 }, function(err, encoding) {});

Returning more detailed results

If you wish to see the full list of possible encodings:

chardet.detectFile('/path/to/file', { returnAllMatches: true }, function(err, encodings) {
  //encodings is an array of objects sorted by confidence value in decending order
  //e.g. [{ confidence: 90, name: 'UTF-8'}, {confidence: 20, name: 'windows-1252', lang: 'fr'}]
});

Supported Encodings:

UTF-8
UTF-16 LE
UTF-16 BE
UTF-32 LE
UTF-32 BE
ISO-2022-JP
ISO-2022-KR
ISO-2022-CN
Shift-JIS
Big5
EUC-JP
EUC-KR
GB18030
ISO-8859-1
ISO-8859-2
ISO-8859-5
ISO-8859-6
ISO-8859-7
ISO-8859-8
ISO-8859-9
windows-1250
windows-1251
windows-1252
windows-1253
windows-1254
windows-1255
windows-1256
KOI8-R

Currently only these encodings are supported, more will be added soon.