Character encoding detection tool for NodeJS
Go to file
Dmitry Shirokov 2437c38ff5
Merge pull request #22 from runk/dependabot/npm_and_yarn/npm-6.13.4
Bump npm from 6.10.1 to 6.13.4
2019-12-30 09:13:33 +11:00
encoding v0.1.0 2015-11-30 08:51:34 +11:00
test Use wrapper functions for returning all matches 2018-07-01 22:52:34 -07:00
.gitignore v0.1.0 2015-11-30 08:51:34 +11:00
.npmignore Add `yarn.lock` to `.npmignore`. 2018-01-25 19:49:49 -07:00
.travis.yml feat(core): Semantic release 2019-07-12 09:18:44 +10:00
LICENSE feat(core): Semantic release 2019-07-12 09:18:44 +10:00
README.md feat(core): Semantic release 2019-07-12 09:18:44 +10:00
index.js Use Buffer.allocUnsafe instead of deprecated Buffer API 2018-07-29 21:01:21 +02:00
match.js major style changes for rest of the files 2013-11-22 15:37:41 +11:00
package-lock.json Merge pull request #22 from runk/dependabot/npm_and_yarn/npm-6.13.4 2019-12-30 09:13:33 +11:00
package.json feat(core): Semantic release 2019-07-12 09:18:44 +10:00

README.md

chardet Build Status

Chardet is a character detection module for NodeJS written in pure Javascript. Module is based on ICU project http://site.icu-project.org/, which uses character occurency analysis to determine the most probable encoding.

Installation

npm i chardet

Usage

To return the encoding with the highest confidence:

var chardet = require('chardet');
chardet.detect(Buffer.from('hello there!'));
// or
chardet.detectFile('/path/to/file', function(err, encoding) {});
// or
chardet.detectFileSync('/path/to/file');

To return the full list of possible encodings:

var chardet = require('chardet');
chardet.detectAll(Buffer.from('hello there!'));
// or
chardet.detectFileAll('/path/to/file', function(err, encoding) {});
// or
chardet.detectFileAllSync('/path/to/file');

//Returned value is an array of objects sorted by confidence value in decending order
//e.g. [{ confidence: 90, name: 'UTF-8'}, {confidence: 20, name: 'windows-1252', lang: 'fr'}]

Working with large data sets

Sometimes, when data set is huge and you want to optimize performace (in tradeoff of less accuracy), you can sample only first N bytes of the buffer:

chardet.detectFile('/path/to/file', { sampleSize: 32 }, function(err, encoding) {});

Supported Encodings:

  • UTF-8
  • UTF-16 LE
  • UTF-16 BE
  • UTF-32 LE
  • UTF-32 BE
  • ISO-2022-JP
  • ISO-2022-KR
  • ISO-2022-CN
  • Shift-JIS
  • Big5
  • EUC-JP
  • EUC-KR
  • GB18030
  • ISO-8859-1
  • ISO-8859-2
  • ISO-8859-5
  • ISO-8859-6
  • ISO-8859-7
  • ISO-8859-8
  • ISO-8859-9
  • windows-1250
  • windows-1251
  • windows-1252
  • windows-1253
  • windows-1254
  • windows-1255
  • windows-1256
  • KOI8-R

Currently only these encodings are supported, more will be added soon.