Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions content/docs/editing.md
Original file line number Diff line number Diff line change
Expand Up @@ -234,17 +234,17 @@ This Virtual Space feature, however, does not influence blank space beyond the l

## Character Panel

The Character Panel, accessed through the **Edit > Character Panel** menu entry, allows the user to interact with the first 256 characters in the active [Encoding or Character set](../preferences/#encoding-menu).
The Character Panel, accessed through the **Edit > Character Panel** menu entry, allows the user to interact with the first 256 characters in the active [Encoding or Character set](../encoding/).

When opened, the Character Panel will be by default a docked window on the right-hand side of the Notepad++ main window, entitled `ASCII Codes Insertion Panel`. (This is a bit of a misnomer since ASCII is defined as values 0 - 127 and the panel shows values in the range 0 - 255.)

This panel contains a grid-like control that has five columns: `Value`, `Hex`, `Character`, `HTML Name`, `HTML Decimal` and `HTML Hexadecimal`. The HTML columns show the various HTML entity formats for each character: `HTML Name` is the named entity, like `"`. `HTML Decimal` (or `HTML Code` in older versions) is the decimal entity, like `"`. And `HTML Hexadecimal` (new to v8.5.2) is the hexadecimal entity, like `"`. (All three of those examples refer to the ASCII double quote `"` character.)

If input focus is moved to a line in the Character Panel and Enter is pressed, the character from the `Character` column will be inserted at the current position in the document being edited. If the mouse is used there is more flexibility: an item from the grid that is double-clicked will be inserted. For example, when double-clicking `"` from the `HTML Name` column on the line of value 34, `"` (as literal text) will be inserted at the current position in the active document -- so this can be used to insert the character number in decimal or hexadecimal, the character itself, or the HTML entity (named or decimal or hexadecimal) into the document being edited.

When Notepad++ is told to interpret a file as ANSI or any of the Character Sets (described in the [**Encoding**-menu docs](../preferences/#encoding-menu)), the Character Panel shows information about the 256 8-bit character numbers (that is, the `Value`) for the character set selected. Note that for all character sets, the 0-127 character values always represent the same character (the ASCII character); character values from 128-255 are character-set specific as to which character each value represents. The panel offers an easy way to see value and character equivalence, and insert characters that don't exist on your keyboard.
When Notepad++ is told to interpret a file as ANSI or any of the Character Sets (described in the [**Encoding**-menu docs](../encoding/)), the Character Panel shows information about the 256 8-bit character numbers (that is, the `Value`) for the character set selected. Note that for all character sets, the 0-127 character values always represent the same character (the ASCII character); character values from 128-255 are character-set specific as to which character each value represents. The panel offers an easy way to see value and character equivalence, and insert characters that don't exist on your keyboard.

When Notepad++ is told to interpret a file as Unicode (the entries starting with `UTF-8` or `UTF-16` in the [**Encoding** menu](../preferences/#encoding-menu)), the Character Panel will show the same characters for values 128-255 as the default codepage character set on the user's system (viewable as `Current ANSI codepage` in the **? > Debug Info** menu entry). In this case, for values 128-255, the `Value` column is meaningless -- only the `Character` shown is important. If this panel is used to insert the character into the active document, the correct Unicode character bytes will be used instead of the value as in the simpler ANSI case. For values from 0-127, the character/value pair _is_ meaningful, because for this range the ANSI set of characters -- the true ASCII set -- line up with the same characters and values in the Unicode set.
When Notepad++ is told to interpret a file as Unicode (the entries starting with `UTF-8` or `UTF-16` in the [**Encoding** menu](../encoding/)), the Character Panel will show the same characters for values 128-255 as the default codepage character set on the user's system (viewable as `Current ANSI codepage` in the **? > Debug Info** menu entry). In this case, for values 128-255, the `Value` column is meaningless -- only the `Character` shown is important. If this panel is used to insert the character into the active document, the correct Unicode character bytes will be used instead of the value as in the simpler ANSI case. For values from 0-127, the character/value pair _is_ meaningful, because for this range the ANSI set of characters -- the true ASCII set -- line up with the same characters and values in the Unicode set.

## Change History

Expand Down
50 changes: 50 additions & 0 deletions content/docs/encoding.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
---
title: Encoding
weight: 155
---

## Encoding Menu

These entries influence the file encoding of the active file -- how the underlying bytes of the file are interpreted as glyphs, and how the characters you enter are saved as underlying bytes. The [New Document](../preferences/#new-document) preferences will influence which Encoding is selected for a new file, and the [MISC > Autodetect character encoding](../preferences/ #misc) preference will affect what encoding will be selected when the file is first read from disk.

There are the major encodings found at the beginning of the **Encodings** menu:
- `ANSI`: A family of encodings based on the active [Windows Code Page](https://en.wikipedia.org/wiki/Windows_code_page) -- most "ANSI" codepages are sets of 256 characters (8-bits); but Windows also allows you to set the "ANSI" codepage to Japanese/Shift-JIS, Simplified Chinese/GBK, Korean Unified Hangul Code, and Traditional Chinese/Big5; and starting in recent Windows 11, also to set the codepage to Unicode UTF-8, which is described more [below](#UseUnicodeUTF8). Whatever code page your OS is set to use (and thus the one that shows up in the **?**-menu's **Debug Info** as `Current ANSI codepage`), that is what the `ANSI` encoding refers to. (It was named generically, because historically, people have thought of their default codepage as the "ANSI" codepage. In the US, that code page is usually Windows-1252, but it depends on your Windows settings.)
- [`UTF-8`](https://en.wikipedia.org/wiki/UTF-8): This encoding uses variable-width multi-byte sequences to represent Unicode characters, either without or with the BOM character at the start of the file. (The BOM isn't technically part of the UTF-8 spec, because there isn't a Little Endian or Big Endian variant of UTF-8 -- the bytes are always in a predefined order. However, many applications use the BOM codepoint to indicate that the file should be interpreted as UTF-8, and Notepad++ supports reading and writing the file with the BOM sequence to support those external applications file-format needs.)
- [`UTF-16`](https://en.wikipedia.org/wiki/UTF-16): This encoding uses two-byte Big Endian or Little Endian sequences to represent Unicode characters.
- The various `Character sets` found in the sub-menus allow you to specify any of the various international sets of characters that provide a limited set of glyphs (most are 8-bits, and thus limited to 256 glyphs), rather than the full suite Unicode character. You can use one of these encodings to be able to edit a file from one character set, even if your Windows code page is at a different code page. For example, this allows you to edit a file using the Eastern European ISO 9959-2 character set even if your copy of Windows is setup for Windows 1252.

The `... with BOM` entries indicate that it uses the Unicode [Byte Order Mark](https://en.wikipedia.org/wiki/Byte_order_mark "BOM") at the start of the file to indicate the correct byte order (big endian or little endian), and in the case of UTF-8, to make it unambiguous that the file is meant to be a UTF-8 Unicode file rather than another 8-bit encoding.

The `Convert to ...` entries below the separator line will change the encoding (the underlying bytes stored on disk) of the active file, without changing the glyphs. So if you just have the Euro currency symbol `€` in your file, it will be stored as byte 0x80 if you `Convert to ANSI` (and are in a Western-European codepage in Windows), as the three-byte sequence 0xE2 0x82 0xAC if you `Convert to UTF-8`, and as the two byte sequence 0x20 0xAC if you `Convert to UTF-16 BE BOM`.

The entries above the separator line (without `Convert to` in the name) show the file's active encoding or character set. If you change that setting manually, it will leave the bytes in the file the same and change the glyph or glyph sequence that is shown, based on the updated interpretation of the bytes. For example, if you enter the `€` in a UTF-8 encoded file, and then manually select `Encoding > ANSI`, suddenly those characters will look something like `€` (depending on the active Windows code page); this is because UTF-8 `€` is the three bytes 0xE2 0x82 0xAC, and those three bytes represent three characters when interpreted as ANSI. Or, if you are starting with a character set of **Western European > OEM-US** (the old DOS box-drawing character set) with the `▓` grey box, if you change to character set to **Western European > Windows-1252**, it will become the `²` superscript 2. (Technically, it doesn't always _just_ convert the interpretation: if you start with one of the 2-byte UTF-16 encodings, which has a 0x00 byte for the bigger of the two bytes, if you switch the interpretation to **ANSI**, instead of showing all those 0x00 bytes as `NUL` characters, it will just not include those bytes in the new interpretation.)

In general, if you want the glyph to stay the same and change the bytes on the disk, then use the `Convert to...` entries; whereas if the glyphs shown don't match what you think the bytes of the data should represent, you probably need to use one of the upper entries to change the interpretation of the bytes.

## Encoding Auto-Detection

If the file you open is encoded in UTF-16 (which always has the byte order mark "BOM" character), or in UTF-8 with the BOM, then Notepad++ will use the encoding based on the BOM.

If the file is an XML or HTML file, then if the encoding is defined in the declaration/prolog, Notepad++ will use that encoding for the file.

Failing that, if [MISC > Autodetect character encoding](../preferences/#misc) is enabled, Notepad++ will also analyze some of the byte sequences in the file, and if they match patterns common to one of the character sets, then Notepad++ will use that encoding.

If it still doesn't have an encoding, then Notepad++ will look to see if it's 100% ASCII (in which case, it chooses "ANSI" or "UTF-8" depending on the [**Apply to opened ANSI files**](../preferences/#new-document) setting); or if all of the non-ASCII bytes follow the rules for valid UTF-8, it will use that encoding.

Finally, if the encoding has not yet been decided (regardless of the autodetection status), Notepad++ will choose the encoding based on the system locale or set it to "ANSI".

If you find that your text with accented characters often gets misinterpreted by Notepad++ (Windows-1255 encoded Hebrew is a common incorrectly-chosen encoding), and if you always or usually just use files that are in the same localization as your Windows is set to, it's generally recommended to turn off character-encoding autodetection, and Notepad++ will be able to use your system setting without incorrectly guessing some other encoding.

## Encoding and Use Unicode UTF-8 for worldwide language support {#UseUnicodeUTF8}

As of Notepad++ version 8.8.8, the **ANSI** and **Convert to ANSI** entries on the **Encoding** menu are disabled when the Windows setting **Use Unicode UTF-8 for worldwide language support** is enabled. When that setting is in effect, the system default code page, which ordinarily defines “ANSI” in Windows, *is* UTF-8; attempting to treat UTF-8 as an ordinary code page does not work properly, which caused erratic behavior prior to version 8.8.8. Since the traditional concept of “ANSI” has no consistent meaning when that Windows setting is enabled, Notepad++ disables `ANSI` encoding. (But even with that OS option set, Notepad++ can still choose one of the Character Set encodings; it just manually selects that entry, not setting it to "ANSI".)

Some Windows 11 installations are coming with that option turned on by default. If you need to be able to use the **Convert to ANSI** action, and you find it's disabled in Notepad++ v8.8.8 or newer (or if that conversion doesn't behave as expected on older versions of Notepad++), you can verify in **?**-menu's **Debug Info**: it will show `Current ANSI codepage: 65001` if that Windows OS option is on. If you want to change that Windows OS setting, Microsoft provides multiple paths to that setting, but two of the common ways to find it are:
1. Windows **Control Panel > Clock & Region** (or just **Region**), go to the **Administrative** tab on the dialog, using the **Change System Locale** button, and toggle the **Use Unicode UTF-8 for worldwide language support** checkmark.
2. Windows **Settings > Time & Language**, in the **Language** (or **Language & Region**) section, find the **Use Unicode UTF-8 for worldwide language support** toggle (it may be not show, in which case, look under the **Windows Display Language** ▼ pulldown to show it).

## Encoding During Editing

Notepad++ does not always edit a document in the same encoding used to store it in its file (which doesn't influence most users, though some plugins may give you byte-level information about the internal representation of the document, instead of the file itself, which has confused users of some plugins). When the encoding (shown in the **Encoding** menu and in [status bar](../user-interface/#status-bar)) is ANSI or UTF-8, you are editing the document in the same encoding as the file; in all other cases (UTF-16 or anything from the **Character sets** sub-menus), you are editing the document as UTF-8, and Notepad++ converts from or to the chosen encoding when opening or saving the file.

Also, for encodings that have the BOM sequence, Notepad++ will _not_ include the BOM character in the editor panel -- but it _is_ in the file on disk during reads and writes; said another way, Notepad++ treats the BOM sequence as "metadata", and doesn't include it in the text you are editing. This means you cannot use the editor panel in Notepad++ to look at, add, or remove the BOM character; if you want to change the BOM status for UTF-8, use the **Convert to UTF-8-BOM** to add the BOM or **Convert to UTF-8** to remove the BOM; if the **Encoding** shows one of the **UTF-16** options chosen, or **UTF-8-BOM**, then you can be confident that Notepad++ will write the BOM when the file saves or read the BOM when the file was loaded.
Loading