Character encoding detection tool for NodeJS

Go to file

dependabot[bot] b4d63f2f20 Bump lodash from 4.17.14 to 4.17.19 Bumps [lodash](https://github.com/lodash/lodash) from 4.17.14 to 4.17.19. - [Release notes](https://github.com/lodash/lodash/releases) - [Commits](https://github.com/lodash/lodash/compare/4.17.14...4.17.19) Signed-off-by: dependabot[bot] <support@github.com>		2020-07-16 03:34:20 +00:00
src	fix: change Shift-JIS mime name to Shift_JIS	2020-07-02 22:56:02 -05:00
.gitignore	BREAKING CHANGE: Repo overhaul	2020-03-31 20:53:25 +11:00
.npmignore	Add `yarn.lock` to `.npmignore`.	2018-01-25 19:49:49 -07:00
.prettierrc.json	BREAKING CHANGE: Repo overhaul	2020-03-31 20:53:25 +11:00
.travis.yml	feat(core): Semantic release	2019-07-12 09:18:44 +10:00
LICENSE	BREAKING CHANGE: Repo overhaul	2020-03-31 20:53:25 +11:00
README.md	fix: change Shift-JIS mime name to Shift_JIS	2020-07-02 22:56:02 -05:00
jest.config.js	BREAKING CHANGE: Repo overhaul	2020-03-31 20:53:25 +11:00
package-lock.json	Bump lodash from 4.17.14 to 4.17.19	2020-07-16 03:34:20 +00:00
package.json	BREAKING CHANGE: Repo overhaul	2020-03-31 20:53:25 +11:00
renovate.json	BREAKING CHANGE: Repo overhaul	2020-03-31 20:53:25 +11:00
tsconfig.json	BREAKING CHANGE: Repo overhaul	2020-03-31 20:53:25 +11:00
tslint.json	BREAKING CHANGE: Repo overhaul	2020-03-31 20:53:25 +11:00

README.md

chardet

Chardet is a character detection module for NodeJS written in pure Javascript. Module is based on ICU project http://site.icu-project.org/, which uses character occurency analysis to determine the most probable encoding.

Installation

npm i chardet

Usage

To return the encoding with the highest confidence:

const chardet = require('chardet');

chardet.detect(Buffer.from('hello there!'));
// or
chardet.detectFile('/path/to/file').then(encoding => console.log(encoding));
// or
chardet.detectFileSync('/path/to/file');

To return the full list of possible encodings use analyse method.

const chardet = require('chardet');
chardet.analyse(Buffer.from('hello there!'));

Returned value is an array of objects sorted by confidence value in decending order

[
  { confidence: 90, name: 'UTF-8' },
  { confidence: 20, name: 'windows-1252', lang: 'fr' }
];

Working with large data sets

Sometimes, when data set is huge and you want to optimize performace (in tradeoff of less accuracy), you can sample only first N bytes of the buffer:

chardet
  .detectFile('/path/to/file', { sampleSize: 32 })
  .then(encoding => console.log(encoding));

Supported Encodings:

UTF-8
UTF-16 LE
UTF-16 BE
UTF-32 LE
UTF-32 BE
ISO-2022-JP
ISO-2022-KR
ISO-2022-CN
Shift_JIS
Big5
EUC-JP
EUC-KR
GB18030
ISO-8859-1
ISO-8859-2
ISO-8859-5
ISO-8859-6
ISO-8859-7
ISO-8859-8
ISO-8859-9
windows-1250
windows-1251
windows-1252
windows-1253
windows-1254
windows-1255
windows-1256
KOI8-R

Currently only these encodings are supported.

Typescript?

Yes. Type definitions are included.