Character encoding detection tool for NodeJS
Go to file
Dmitry Shirokov a6fca620c8 Merge pull request #27 from runk/typescript
BREAKING CHANGE: Repo overhaul

- Deprecation of callbacks in favour of promises
- Deprecation of `detectAll`, `detectFileAll` and `detectFileAllSync` - use `analyse` fn instead.
- Typescript typings now included as part of distribution
- Modules support
- Travis CI org => com migration
- Lazy loading of `fs` module to enable usage in browser
2020-03-31 21:22:35 +11:00
src BREAKING CHANGE: Repo overhaul 2020-03-31 20:53:25 +11:00
.gitignore BREAKING CHANGE: Repo overhaul 2020-03-31 20:53:25 +11:00
.npmignore Add `yarn.lock` to `.npmignore`. 2018-01-25 19:49:49 -07:00
.prettierrc.json BREAKING CHANGE: Repo overhaul 2020-03-31 20:53:25 +11:00
.travis.yml feat(core): Semantic release 2019-07-12 09:18:44 +10:00
LICENSE BREAKING CHANGE: Repo overhaul 2020-03-31 20:53:25 +11:00
README.md BREAKING CHANGE: Repo overhaul 2020-03-31 20:53:25 +11:00
jest.config.js BREAKING CHANGE: Repo overhaul 2020-03-31 20:53:25 +11:00
package-lock.json BREAKING CHANGE: Repo overhaul 2020-03-31 20:53:25 +11:00
package.json BREAKING CHANGE: Repo overhaul 2020-03-31 20:53:25 +11:00
renovate.json BREAKING CHANGE: Repo overhaul 2020-03-31 20:53:25 +11:00
tsconfig.json BREAKING CHANGE: Repo overhaul 2020-03-31 20:53:25 +11:00
tslint.json BREAKING CHANGE: Repo overhaul 2020-03-31 20:53:25 +11:00

README.md

chardet Build Status

Chardet is a character detection module for NodeJS written in pure Javascript. Module is based on ICU project http://site.icu-project.org/, which uses character occurency analysis to determine the most probable encoding.

Installation

npm i chardet

Usage

To return the encoding with the highest confidence:

const chardet = require('chardet');

chardet.detect(Buffer.from('hello there!'));
// or
chardet.detectFile('/path/to/file').then(encoding => console.log(encoding));
// or
chardet.detectFileSync('/path/to/file');

To return the full list of possible encodings use analyse method.

const chardet = require('chardet');
chardet.analyse(Buffer.from('hello there!'));

Returned value is an array of objects sorted by confidence value in decending order

[
  { confidence: 90, name: 'UTF-8' },
  { confidence: 20, name: 'windows-1252', lang: 'fr' }
];

Working with large data sets

Sometimes, when data set is huge and you want to optimize performace (in tradeoff of less accuracy), you can sample only first N bytes of the buffer:

chardet
  .detectFile('/path/to/file', { sampleSize: 32 })
  .then(encoding => console.log(encoding));

Supported Encodings:

  • UTF-8
  • UTF-16 LE
  • UTF-16 BE
  • UTF-32 LE
  • UTF-32 BE
  • ISO-2022-JP
  • ISO-2022-KR
  • ISO-2022-CN
  • Shift-JIS
  • Big5
  • EUC-JP
  • EUC-KR
  • GB18030
  • ISO-8859-1
  • ISO-8859-2
  • ISO-8859-5
  • ISO-8859-6
  • ISO-8859-7
  • ISO-8859-8
  • ISO-8859-9
  • windows-1250
  • windows-1251
  • windows-1252
  • windows-1253
  • windows-1254
  • windows-1255
  • windows-1256
  • KOI8-R

Currently only these encodings are supported.

Typescript?

Yes. Type definitions are included.