Character encoding detection tool for NodeJS

Go to file

Dmitry Shirokov 7430c84b78 Merge pull request #77 from dnicolson/fix-capitalization Fix capitalization in README		2023-03-10 05:24:03 +05:30
.github/workflows	chore: Use npx instead of `npm bin`	2023-03-10 10:51:36 +11:00
src	fix(types): Export AnalyseResult and DetectResult types	2023-01-06 09:08:20 +11:00
.gitignore	Revert "Expose build"	2022-10-06 10:39:49 +02:00
.npmignore	Add `yarn.lock` to `.npmignore`.	2018-01-25 19:49:49 -07:00
.npmrc	chore: npmrc	2022-10-03 12:23:21 +11:00
.prettierrc.json	BREAKING CHANGE: Repo overhaul	2020-03-31 20:53:25 +11:00
LICENSE	patch: Export AnalyseResult and DetectResult types	2023-01-06 07:57:59 +11:00
README.md	Fix capitalization in README	2023-03-08 23:38:03 +01:00
jest.config.js	BREAKING CHANGE: Repo overhaul	2020-03-31 20:53:25 +11:00
package.json	chore(deps): update dependency semantic-release to v20	2023-01-06 23:58:37 +00:00
renovate.json	BREAKING CHANGE: Repo overhaul	2020-03-31 20:53:25 +11:00
tsconfig.json	Maintenance	2022-09-30 12:19:50 +10:00

README.md

chardet

Chardet is a character detection module written in pure JavaScript (TypeScript). Module uses occurrence analysis to determine the most probable encoding.

Packed size is only 22 KB
Works in all environments: Node / Browser / Native
Works on all platforms: Linux / Mac / Windows
No dependencies
No native code / bindings
100% written in TypeScript
Extensive code coverage

Installation

npm i chardet

Usage

To return the encoding with the highest confidence:

import chardet from 'chardet';

const encoding = chardet.detect(Buffer.from('hello there!'));
// or
const encoding = await chardet.detectFile('/path/to/file');
// or
const encoding = chardet.detectFileSync('/path/to/file');

To return the full list of possible encodings use analyse method.

import chardet from 'chardet';
chardet.analyse(Buffer.from('hello there!'));

Returned value is an array of objects sorted by confidence value in descending order

[
  { confidence: 90, name: 'UTF-8' },
  { confidence: 20, name: 'windows-1252', lang: 'fr' }
];

Working with large data sets

Sometimes, when data set is huge and you want to optimize performance (with a tradeoff of less accuracy), you can sample only the first N bytes of the buffer:

chardet
  .detectFile('/path/to/file', { sampleSize: 32 })
  .then(encoding => console.log(encoding));

You can also specify where to begin reading from in the buffer:

chardet
  .detectFile('/path/to/file', { sampleSize: 32, offset: 128 })
  .then(encoding => console.log(encoding));

Supported Encodings:

UTF-8
UTF-16 LE
UTF-16 BE
UTF-32 LE
UTF-32 BE
ISO-2022-JP
ISO-2022-KR
ISO-2022-CN
Shift_JIS
Big5
EUC-JP
EUC-KR
GB18030
ISO-8859-1
ISO-8859-2
ISO-8859-5
ISO-8859-6
ISO-8859-7
ISO-8859-8
ISO-8859-9
windows-1250
windows-1251
windows-1252
windows-1253
windows-1254
windows-1255
windows-1256
KOI8-R

Currently only these encodings are supported.

TypeScript?

Yes. Type definitions are included.

References

ICU project http://site.icu-project.org/