2020-03-30 03:42:04 +00:00
|
|
|
# chardet [![Build Status](https://travis-ci.org/runk/node-chardet.png)](https://travis-ci.org/runk/node-chardet)
|
2013-04-30 13:49:02 +00:00
|
|
|
|
2013-04-30 13:57:19 +00:00
|
|
|
Chardet is a character detection module for NodeJS written in pure Javascript.
|
|
|
|
Module is based on ICU project http://site.icu-project.org/, which uses character
|
|
|
|
occurency analysis to determine the most probable encoding.
|
|
|
|
|
|
|
|
## Installation
|
2013-04-30 13:49:02 +00:00
|
|
|
|
2013-11-15 05:08:09 +00:00
|
|
|
```
|
|
|
|
npm i chardet
|
|
|
|
```
|
2013-04-30 13:49:02 +00:00
|
|
|
|
2013-04-30 13:57:19 +00:00
|
|
|
## Usage
|
2013-04-30 13:49:02 +00:00
|
|
|
|
2018-07-01 07:27:14 +00:00
|
|
|
To return the encoding with the highest confidence:
|
2020-03-30 03:42:04 +00:00
|
|
|
|
2013-11-15 05:08:09 +00:00
|
|
|
```javascript
|
2020-03-30 03:42:04 +00:00
|
|
|
const chardet = require('chardet');
|
|
|
|
|
2019-07-11 23:12:00 +00:00
|
|
|
chardet.detect(Buffer.from('hello there!'));
|
2013-11-15 05:08:09 +00:00
|
|
|
// or
|
2020-03-30 03:42:04 +00:00
|
|
|
chardet.detectFile('/path/to/file').then(encoding => console.log(encoding));
|
2013-11-15 05:08:09 +00:00
|
|
|
// or
|
|
|
|
chardet.detectFileSync('/path/to/file');
|
|
|
|
```
|
2013-04-30 13:49:02 +00:00
|
|
|
|
2020-03-30 03:42:04 +00:00
|
|
|
To return the full list of possible encodings use `analyse` method.
|
2018-07-01 07:27:14 +00:00
|
|
|
|
|
|
|
```javascript
|
2020-03-30 03:42:04 +00:00
|
|
|
const chardet = require('chardet');
|
|
|
|
chardet.analyse(Buffer.from('hello there!'));
|
|
|
|
```
|
2018-07-01 07:27:14 +00:00
|
|
|
|
2020-03-30 03:42:04 +00:00
|
|
|
Returned value is an array of objects sorted by confidence value in decending order
|
|
|
|
|
|
|
|
```javascript
|
|
|
|
[
|
|
|
|
{ confidence: 90, name: 'UTF-8' },
|
|
|
|
{ confidence: 20, name: 'windows-1252', lang: 'fr' }
|
|
|
|
];
|
2018-07-01 07:27:14 +00:00
|
|
|
```
|
|
|
|
|
2017-10-16 00:46:01 +00:00
|
|
|
## Working with large data sets
|
|
|
|
|
2018-08-17 22:51:45 +00:00
|
|
|
Sometimes, when data set is huge and you want to optimize performace (in tradeoff of less accuracy),
|
2017-10-16 00:46:01 +00:00
|
|
|
you can sample only first N bytes of the buffer:
|
|
|
|
|
|
|
|
```javascript
|
2020-03-30 03:42:04 +00:00
|
|
|
chardet
|
|
|
|
.detectFile('/path/to/file', { sampleSize: 32 })
|
|
|
|
.then(encoding => console.log(encoding));
|
2017-10-16 00:46:01 +00:00
|
|
|
```
|
|
|
|
|
2013-04-30 13:57:19 +00:00
|
|
|
## Supported Encodings:
|
2013-04-30 13:49:02 +00:00
|
|
|
|
2020-03-30 03:42:04 +00:00
|
|
|
- UTF-8
|
|
|
|
- UTF-16 LE
|
|
|
|
- UTF-16 BE
|
|
|
|
- UTF-32 LE
|
|
|
|
- UTF-32 BE
|
|
|
|
- ISO-2022-JP
|
|
|
|
- ISO-2022-KR
|
|
|
|
- ISO-2022-CN
|
2020-07-03 03:56:02 +00:00
|
|
|
- Shift_JIS
|
2020-03-30 03:42:04 +00:00
|
|
|
- Big5
|
|
|
|
- EUC-JP
|
|
|
|
- EUC-KR
|
|
|
|
- GB18030
|
|
|
|
- ISO-8859-1
|
|
|
|
- ISO-8859-2
|
|
|
|
- ISO-8859-5
|
|
|
|
- ISO-8859-6
|
|
|
|
- ISO-8859-7
|
|
|
|
- ISO-8859-8
|
|
|
|
- ISO-8859-9
|
|
|
|
- windows-1250
|
|
|
|
- windows-1251
|
|
|
|
- windows-1252
|
|
|
|
- windows-1253
|
|
|
|
- windows-1254
|
|
|
|
- windows-1255
|
|
|
|
- windows-1256
|
|
|
|
- KOI8-R
|
|
|
|
|
|
|
|
Currently only these encodings are supported.
|
|
|
|
|
|
|
|
## Typescript?
|
|
|
|
|
|
|
|
Yes. Type definitions are included.
|