chardet/README.md

90 lines
1.8 KiB
Markdown
Raw Normal View History

# chardet [![Build Status](https://travis-ci.org/runk/node-chardet.png)](https://travis-ci.org/runk/node-chardet)
2013-04-30 13:57:19 +00:00
Chardet is a character detection module for NodeJS written in pure Javascript.
Module is based on ICU project http://site.icu-project.org/, which uses character
occurency analysis to determine the most probable encoding.
## Installation
2013-11-15 05:08:09 +00:00
```
npm i chardet
```
2013-04-30 13:57:19 +00:00
## Usage
To return the encoding with the highest confidence:
2013-11-15 05:08:09 +00:00
```javascript
const chardet = require('chardet');
2019-07-11 23:12:00 +00:00
chardet.detect(Buffer.from('hello there!'));
2013-11-15 05:08:09 +00:00
// or
chardet.detectFile('/path/to/file').then(encoding => console.log(encoding));
2013-11-15 05:08:09 +00:00
// or
chardet.detectFileSync('/path/to/file');
```
To return the full list of possible encodings use `analyse` method.
```javascript
const chardet = require('chardet');
chardet.analyse(Buffer.from('hello there!'));
```
Returned value is an array of objects sorted by confidence value in decending order
```javascript
[
{ confidence: 90, name: 'UTF-8' },
{ confidence: 20, name: 'windows-1252', lang: 'fr' }
];
```
2017-10-16 00:46:01 +00:00
## Working with large data sets
Sometimes, when data set is huge and you want to optimize performace (in tradeoff of less accuracy),
2017-10-16 00:46:01 +00:00
you can sample only first N bytes of the buffer:
```javascript
chardet
.detectFile('/path/to/file', { sampleSize: 32 })
.then(encoding => console.log(encoding));
2017-10-16 00:46:01 +00:00
```
2013-04-30 13:57:19 +00:00
## Supported Encodings:
- UTF-8
- UTF-16 LE
- UTF-16 BE
- UTF-32 LE
- UTF-32 BE
- ISO-2022-JP
- ISO-2022-KR
- ISO-2022-CN
- Shift_JIS
- Big5
- EUC-JP
- EUC-KR
- GB18030
- ISO-8859-1
- ISO-8859-2
- ISO-8859-5
- ISO-8859-6
- ISO-8859-7
- ISO-8859-8
- ISO-8859-9
- windows-1250
- windows-1251
- windows-1252
- windows-1253
- windows-1254
- windows-1255
- windows-1256
- KOI8-R
Currently only these encodings are supported.
## Typescript?
Yes. Type definitions are included.