Skip to content

Commit 9a45e02

Browse files
committed
Merge branch 'develop'
2 parents 351e06f + 57cda42 commit 9a45e02

17 files changed

+278
-229
lines changed

.github/workflows/ci.yml

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -110,8 +110,6 @@ jobs:
110110
compiler: gcc-8, cxxstd: '11,2a', os: ubuntu-latest, container: 'ubuntu:20.04', install: 'g++-8-multilib', address-model: '32,64' }
111111

112112
# Linux, clang
113-
- { compiler: clang-3.5, cxxstd: '11', os: ubuntu-latest, container: 'ubuntu:16.04' }
114-
- { compiler: clang-3.6, cxxstd: '11,14', os: ubuntu-latest, container: 'ubuntu:16.04' }
115113
- { compiler: clang-3.7, cxxstd: '11,14', os: ubuntu-latest, container: 'ubuntu:16.04' }
116114
- { compiler: clang-3.8, cxxstd: '11,14', os: ubuntu-latest, container: 'ubuntu:16.04' }
117115
- { compiler: clang-3.9, cxxstd: '11,14', os: ubuntu-latest, container: 'ubuntu:18.04' }
@@ -142,7 +140,7 @@ jobs:
142140

143141
# OSX, clang
144142
- { name: MacOS w/ clang and sanitizers,
145-
compiler: clang, cxxstd: '11,14,17,20,2b', os: macos-13, ubsan: yes }
143+
compiler: clang, cxxstd: '11,14,17,20,2b', os: macos-15, ubsan: yes }
146144
# TODO: Iconv issue
147145
#- { compiler: clang, cxxstd: '11,14,17,20,2b', os: macos-14 }
148146

doc/building_boost_locale.txt

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,8 @@
3232
so you need to either use GNU iconv or link with the ICU library.
3333
- If the iconv library is not found on Darwin/Mac OS X builds make sure there
3434
are not multiple iconv installations and provide the -sICONV_PATH build option
35-
to point to the correct location of the iconv library.
35+
to point to the correct location of the iconv library.
36+
Using the HomeBrew installed GNU IConv is highly recommended!
3637

3738
\subsection bb_building_proc Building Process
3839

doc/main.txt

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -54,6 +54,13 @@ It is based on the operating system native API or on the standard
5454
C++ library support. Sacrificing some less important features,
5555
Boost.Locale becomes less powerful but lighter and easier to deploy.
5656

57+
Charset conversion is also provided through the lightweight IConv library.
58+
When that is not available ICU or operating system native APIs are used.
59+
60+
\warning The system IConv library on Apple macOS may not be GNU conformant
61+
and can lead to unexpected results during encoding conversions.
62+
Using the GNU IConv provided by HomeBrew is highly recommended.
63+
5764

5865
\section main_tutorial Tutorials
5966

doc/rationale.txt

Lines changed: 9 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,6 @@ Almost every(!) facet has design flaws:
3737

3838
- \c std::ctype, which is responsible for case conversion, assumes that all conversions can be done on a per-character basis. This is
3939
probably correct for many languages but it isn't correct in general.
40-
\n
4140
-# Case conversion may change a string's length. For example, the German word "grüßen" should be converted to "GRÜSSEN" in upper
4241
case: the letter "ß" should be converted to "SS", but the \c toupper function works on a single-character basis.
4342
-# Case conversion is context-sensitive. For example, the Greek word "ὈΔΥΣΣΕΎΣ" should be converted to "ὀδυσσεύς", where the Greek letter
@@ -48,20 +47,16 @@ Almost every(!) facet has design flaws:
4847
- \c std::numpunct and \c std::moneypunct do not specify the code points for digit representation at all,
4948
so they cannot format numbers with the digits used under Arabic locales. For example,
5049
the number "103" is expected to be displayed as "١٠٣" in the \c ar_EG locale.
51-
\n
5250
\c std::numpunct and \c std::moneypunct assume that the thousands separator is a single character. This is untrue
5351
for the UTF-8 encoding where only Unicode 0-0x7F range can be represented as a single character. As a result, localized numbers can't be
5452
represented correctly under locales that use the Unicode "EN SPACE" character for the thousands separator, such as Russian.
55-
\n
5653
This actually causes real problems under GCC and SunStudio compilers, where formatting numbers under a Russian locale creates invalid
5754
UTF-8 sequences.
5855
- \c std::time_put and \c std::time_get have several flaws:
5956
-# They assume that the calendar is always Gregorian, by using \c std::tm for time representation, ignoring the fact that in many
6057
countries dates may be displayed using different calendars.
6158
-# They always use a global time zone, not allowing specification of the time zone for formatting. The standard \c std::tm doesn't
6259
even include a timezone field at all.
63-
-# \c std::time_get is not symmetric with \c std::time_put, so you cannot parse dates and times created with \c std::time_put .
64-
(This issue is addressed in C++11 and some STL implementation like the Apache standard C++ library.)
6560
- \c std::messages does not provide support for plural forms, making it impossible to correctly localize such simple strings as
6661
"There are X files in the directory".
6762

@@ -75,13 +70,13 @@ ICU is a very good localization library, but it has several serious flaws:
7570
- It is absolutely unfriendly to C++ developers. It ignores popular C++ idioms (the STL, RTTI, exceptions, etc), instead
7671
mostly mimicking the Java API.
7772
- It provides support for only one kind of string, UTF-16, when some users may want other Unicode encodings.
78-
For example, for XML or HTML processing UTF-8 is much more convenient and UTF-32 easier to use. Also there is no support for
73+
For example, for XML or HTML processing UTF-8 is much more convenient and UTF-32 easier to use. Also, there is no support for
7974
"narrow" encodings that are still very popular, such as the ISO-8859 encodings.
8075

8176
For example: Boost.Locale provides direct integration with \c iostream allowing a more natural way of data formatting. For example:
8277

8378
\code
84-
cout << "You have "<<as::currency << 134.45 << " in your account as of "<<as::datetime << std::time(0) << endl;
79+
cout << "You have "<<as::currency << 134.45 << " in your account as of "<< as::datetime << std::time(0) << endl;
8580
\endcode
8681

8782
\section why_icu_wrapper Why an ICU wrapper and not an implementation-from-scratch?
@@ -145,21 +140,16 @@ There are several reasons:
145140
-# A Gregorian Date by definition can't be used to represent locale-independent dates, because not all
146141
calendars are Gregorian.
147142
-# \c ptime -- definitely could be used, but it has several problems:
148-
\n
149143
- It is created in GMT or Local time clock, when `time()` gives a representation that is independent of time zones
150144
(usually GMT time), and only later should it be represented in a time zone that the user requests.
151-
\n
152145
The timezone is not a property of time itself, but it is rather a property of time formatting.
153-
\n
154146
- \c ptime already defines \c operator<< and \c operator>> for time formatting and parsing.
155147
- The existing facets for \c ptime formatting and parsing were not designed in a way that the user can override.
156148
The major formatting and parsing functions are not virtual. This makes it impossible to reimplement the formatting and
157149
parsing functions of \c ptime unless the developers of the Boost.DateTime library decide to change them.
158-
\n
159150
Also, the facets of \c ptime are not "correctly" designed in terms of division of formatting information and
160151
locale information. Formatting information should be stored within \c std::ios_base and information about
161152
locale-specific formatting should be stored in the facet itself.
162-
\n
163153
The user of the library should not have to create new facets to change simple formatting information like "display only
164154
the date" or "display both date and time."
165155

@@ -174,30 +164,28 @@ do not actually know how the text should be encoded -- UTF-8, ISO-8859-1, ISO-88
174164
This may vary between different operating systems and depends on the current installation. So it is critical
175165
to provide all the required information.
176166
- ICU fully understands POSIX locales and knows how to treat them correctly.
177-
- They are native locale names for most operating system APIs (with the exception of Windows)
167+
- They are native locale names for most operating system APIs (except for Windows)
178168

179169
\section why_linear_chunks Why do most parts of Boost.Locale work only on linear/contiguous chunks of text?
180170

181171
There are two reasons:
182172

183-
- Boost.Locale relies heavily on the third-party APIs like ICU, POSIX or Win32 API, all of them
184-
work only on linear chunks of text, so providing non-linear API would just hide the
173+
- Boost.Locale relies heavily on third-party APIs like ICU, POSIX or Win32 API, all of them
174+
work only on linear chunks of text, so providing a non-linear API would just hide the
185175
real situation and would hurt performance.
186176
- In fact, all known libraries that work with Unicode: ICU, Qt, Glib, Win32 API, POSIX API
187177
and others accept an input as single linear chunks of text and there is a good reason for this:
188-
\n
189178
-# Most supported operations on text like collation, case handling usually work on small
190179
chunks of text. For example: you probably would never want to compare two chapters of a book, but rather
191180
their titles.
192181
-# We should remember that even very large texts require quite a small amount of memory, for example
193182
the entire book "War and Peace" takes only about 3MB of memory.
194-
\n
195183

196184
However:
197185

198-
- There are API's that support stream processing. For example: character set conversion using
186+
- There are APIs that support stream processing. For example: character set conversion using the
199187
\c std::codecvt API works on streams of any size without problems.
200-
- When new API is introduced into Boost.Locale in future, such that it likely works
188+
- When new API is introduced into Boost.Locale in the future, such that it likely works
201189
on large chunks of text, will provide an interface for non-linear text handling.
202190

203191

@@ -207,27 +195,9 @@ There are several major reasons:
207195

208196
- This is how the C++'s \c std::locale class is build. Each feature is represented using a subclass of
209197
\c std::locale::facet that provides an abstract API for specific operations it works on, see \ref std_locales.
210-
- This approach allows to switch underlying API without changing the actual application code even in run-time depending
198+
- This approach allows to switch underlying the API without changing the actual application code even in run-time depending
211199
on performance and localization requirements.
212-
- This approach reduces compilation times significantly. This is very important for library that may be
200+
- This approach reduces compilation times significantly. This is very important for a library that may be
213201
used in almost every part of specific program.
214202

215-
\section why_no_special_character_type Why doesn't Boost.Locale provide char16_t/char32_t for non-C++11 platforms?
216203

217-
There are several reasons:
218-
219-
- C++11 defines \c char16_t and \c char32_t as distinct types, so substituting it with something like \c uint16_t or \c uint32_t
220-
would not work as for example writing \c uint16_t to \c uint32_t stream would write a number to stream.
221-
- The C++ locales system would work only if standard facets like \c std::num_put are installed into the
222-
existing instance of \c std::locale, however in the many standard C++ libraries these facets are specialized for each
223-
specific character that the standard library supports, so an attempt to create a new facet would
224-
fail as it is not specialized.
225-
226-
These are exactly the reasons why Boost.Locale fails with current limited C++11 characters support on GCC-4.5 (the second reason)
227-
and MSVC-2010 (the first reason)
228-
229-
Basically it is impossible to use non-C++ characters with the C++'s locales framework.
230-
231-
The best and the most portable solution is to use the C++'s \c char type and UTF-8 encodings.
232-
233-
*/

doc/status_of_cpp0x_characters_support.txt

Lines changed: 10 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -7,32 +7,30 @@
77
/*!
88
\page status_of_cpp0x_characters_support Status of C++11 char16_t/char32_t support
99

10-
The support of C++11 \c char16_t and \c char32_t is experimental, mostly does not work, and is not
11-
intended to be used in production with the latest compilers: GCC-4.5, MSVC10 until major
12-
compiler flaws are fixed.
10+
The support of C++11 \c char16_t and \c char32_t is experimental and is not
11+
intended to be used in production until various compiler/standard library flaws are fixed.
1312

14-
\section status_of_cpp0x_characters_support_gnu GNU GCC 4.5/C++11 Status
13+
Many recent C++ compilers provide decent support of C++11 characters, however often:
1514

16-
GNU C++ compiler provides decent support of C++11 characters however:
17-
18-
-# Standard library does not install any std::locale::facets for this support so any attempt
15+
-# The standard library does not install any std::locale::facets for this support so any attempt
1916
to format numbers using \c char16_t or \c char32_t streams would just fail.
20-
-# Standard library misses specialization for required \c char16_t/char32_t locale facets,
17+
-# The standard library misses specialization for required \c char16_t/char32_t locale facets,
2118
so "std" backends is not build-able as essential symbols missing, also \c codecvt facet
2219
can't be created as well.
2320

24-
\section status_of_cpp0x_characters_support_msvc Visual Studio 2010 (MSVC10)/C++11 Status
21+
\section status_of_cpp0x_characters_support_msvc Visual Studio
2522

26-
MSVC provides all required facets however:
23+
MSVC provides all required facets since VS 2010 however:
2724

28-
-# Standard library does not provide installations of std::locale::id for these facets
25+
-# The standard library does not provide installations of std::locale::id for these facets
2926
in DLL so it is not usable with \c /MD, \c /MDd compiler flags and requires static link of the runtime
3027
library.
3128
-# \c char16_t and \c char32_t are not distinct types but rather aliases of unsigned short and unsigned
3229
types which contradicts to C++11 requirements making it impossible to write \c char16_t/char32_t to stream
3330
and causing multiple faults.
3431

35-
If you want to build or test Boost.Locale with C++11 char16_t and char32_t support you should pass `cxxflags="-DBOOST_LOCALE_ENABLE_CHAR32_T -DBOOST_LOCALE_ENABLE_CHAR16_T"` to `b2` during build and define `BOOST_LOCALE_ENABLE_CHAR32_T` and `BOOST_LOCALE_ENABLE_CHAR32_T` when using Boost.Locale
32+
If you want to build or test Boost.Locale with C++11 char16_t and char32_t support
33+
you should pass `define=BOOST_LOCALE_ENABLE_CHAR32_T define=BOOST_LOCALE_ENABLE_CHAR16_T` to `b2` during build and define `BOOST_LOCALE_ENABLE_CHAR32_T` and `BOOST_LOCALE_ENABLE_CHAR32_T` when using Boost.Locale
3634

3735
*/
3836

doc/using_localization_backends.txt

Lines changed: 1 addition & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -94,7 +94,7 @@ problems with this.
9494
</tr>
9595
<tr>
9696
<th>Non UTF-8 encodings</th>
97-
<td>Yes</td><td>Yes</td><td>No</td><td>Yes</td>
97+
<td>Yes</td><td>Yes</td><td>Yes</td><td>Yes</td>
9898
</tr>
9999
<tr>
100100
<th>Date/Time Formatting/Parsing</th>
@@ -132,10 +132,6 @@ problems with this.
132132
<th>Unicode Normalization</th>
133133
<td>Yes</td><td>No</td><td>Vista and above</td><td>No</td>
134134
</tr>
135-
<tr>
136-
<th>C++11 characters</th>
137-
<td>Yes</td><td>No</td><td>No</td><td>Yes</td>
138-
</tr>
139135
<tr>
140136
<th>OS Support</th>
141137
<td>Any</td><td>Linux, Mac OS X</td><td>Windows, Cygwin</td><td>Any</td>

include/boost/locale/generic_codecvt.hpp

Lines changed: 14 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -214,9 +214,8 @@ namespace boost { namespace locale {
214214
// mbstate_t is POD type and should be initialized to 0 (i.a. state = stateT())
215215
// according to standard. We use it to keep a flag 0/1 for surrogate pair writing
216216
//
217-
// if 0/false no codepoint above >0xFFFF observed, else a codepoint above 0xFFFF was observed
218-
// and first pair is written, but no input consumed
219-
bool state = *reinterpret_cast<char*>(&std_state) != 0;
217+
// If true then only the high surrogate of a codepoint > 0xFFFF was written, but no input consumed.
218+
bool low_surrogate_pending = *reinterpret_cast<char*>(&std_state) != 0;
220219
auto cvt_state = implementation().initial_state(to_unicode_state);
221220
while(to < to_end && from < from_end) {
222221
const char* from_saved = from;
@@ -237,31 +236,29 @@ namespace boost { namespace locale {
237236
if(ch <= 0xFFFF)
238237
*to++ = static_cast<uchar>(ch);
239238
else {
240-
// For other codepoints we do the following
239+
// For other codepoints we can't consume our input as we may find ourselves in a state
240+
// where all input is consumed but not all output written, i.e. only the high surrogate is written.
241241
//
242-
// 1. We can't consume our input as we may find ourselves
243-
// in state where all input consumed but not all output written,i.e. only
244-
// 1st pair is written
245-
// 2. We only write first pair and mark this in the state, we also revert back
246-
// the from pointer in order to make sure this codepoint would be read
247-
// once again and then we would consume our input together with writing
248-
// second surrogate pair
242+
// So we write only the high surrogate and mark this in the state.
243+
// We also set the from pointer to the previous position, i.e. don't consume the input, so this
244+
// codepoint will be read again and then we will consume our input together with writing the low
245+
// surrogate.
249246
ch -= 0x10000;
250-
std::uint16_t w1 = static_cast<std::uint16_t>(0xD800 | (ch >> 10));
251-
std::uint16_t w2 = static_cast<std::uint16_t>(0xDC00 | (ch & 0x3FF));
252-
if(!state) {
247+
const std::uint16_t w1 = static_cast<std::uint16_t>(0xD800 | (ch >> 10));
248+
const std::uint16_t w2 = static_cast<std::uint16_t>(0xDC00 | (ch & 0x3FF));
249+
if(!low_surrogate_pending) {
253250
from = from_saved;
254251
*to++ = w1;
255252
} else
256253
*to++ = w2;
257-
state = !state;
254+
low_surrogate_pending = !low_surrogate_pending;
258255
}
259256
}
260257
from_next = from;
261258
to_next = to;
262-
if(r == std::codecvt_base::ok && (from != from_end || state))
259+
if(r == std::codecvt_base::ok && (from != from_end || low_surrogate_pending))
263260
r = std::codecvt_base::partial;
264-
*reinterpret_cast<char*>(&std_state) = state;
261+
*reinterpret_cast<char*>(&std_state) = low_surrogate_pending;
265262
return r;
266263
}
267264

src/std/codecvt.cpp

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,11 +18,14 @@ namespace boost { namespace locale { namespace impl_std {
1818
std::locale
1919
create_codecvt(const std::locale& in, const std::string& locale_name, char_facet_t type, utf8_support utf)
2020
{
21-
#if defined(BOOST_WINDOWS)
21+
#if defined(BOOST_WINDOWS) || defined(__CYGWIN__)
2222
// This isn't fully correct:
2323
// It will treat the 2-Byte wchar_t as UTF-16 encoded while it may be UCS-2
2424
// std::basic_filebuf explicitely disallows using suche multi-byte codecvts
2525
// but it works in practice so far, so use it instead of failing for codepoints above U+FFFF
26+
//
27+
// Additionally, the stdlib in Cygwin has issues converting long UTF-8 sequences likely due to left-over
28+
// state across buffer boundaries. E.g. the low surrogate after a sequence of 255 UTF-16 pairs gets corrupted.
2629
if(utf != utf8_support::none)
2730
return util::create_utf8_codecvt(in, type);
2831
#endif

0 commit comments

Comments
 (0)