-
-
Notifications
You must be signed in to change notification settings - Fork 585
Description
Issue Description
When migrating data from SQL Server to PostgreSQL using pgloader, Arabic text stored in NVARCHAR and VARCHAR columns is being converted to question marks (???) instead of preserving the original Arabic characters.
Expected Behavior
Arabic text should be migrated correctly and displayed as proper Arabic characters in PostgreSQL.
Actual Behavior
Arabic text appears as question marks (???) in PostgreSQL after migration
Hex analysis shows the data is stored as literal question mark characters (3f3f3f3f3f)
New UTF-8 Arabic text inserted directly into PostgreSQL works correctly
Source Data Details
SQL Server Configuration:
Arabic text is stored in NVARCHAR columns (UTF-16 encoding)
Some columns may use VARCHAR with Arabic Windows collation (Windows-1256)
Data displays correctly in SQL Server Management Studio
Sample SQL Server Data:
sql-- Example of data that should migrate
SELECT StatusNameSL FROM ClaStatus WHERE StatusNameSL LIKE N'%Arabic_Text%'
-- Shows proper Arabic characters in SSMS
Migration Configuration Attempted
Configuration 1: Basic Setup
lispLOAD DATABASE
FROM mssql://user:password@server/source_database
INTO postgresql://user:password@localhost/target_database
WITH include drop, create tables, create indexes, reset sequences
CAST type nvarchar to text,
type varchar to text
ALTER SCHEMA 'dbo' RENAME TO 'public';
Configuration 2: With UTF-8 Decoding
lispLOAD DATABASE
FROM mssql://user:password@server/source_database
INTO postgresql://user:password@localhost/target_database
WITH include drop, create tables, create indexes, reset sequences,
decoding-as utf8
CAST type nvarchar to text,
type varchar to text
ALTER SCHEMA 'dbo' RENAME TO 'public';
Configuration 3: With Encoding Specifications
lispLOAD DATABASE
FROM mssql://user:password@server/source_database?charset=utf8
INTO postgresql://user:password@localhost/target_database
WITH include drop, create tables, create indexes, reset sequences
SET client_encoding to 'UTF8'
CAST type nvarchar to text using utf8-decode,
type varchar to text using utf8-decode
BEFORE LOAD DO
$$ SET client_encoding = 'UTF8'; $$;
Command Line Attempts
bashpgloader --encoding utf8 config.load
pgloader --with "decoding-as utf8" config.load
Investigation Results
PostgreSQL Analysis
After migration, analyzing the data shows:
sql-- Check encoding
SHOW server_encoding; -- UTF8
SHOW client_encoding; -- UTF8
-- Hex analysis of migrated data
SELECT encode(column_name::bytea, 'hex') FROM table_name;
-- Result: 3f3f3f3f3f (question marks in hex)
-- Test with manually inserted Arabic text
INSERT INTO table_name (column_name) VALUES ('النص العربي');
-- This works correctly and shows proper hex values like: d8a7d984d984d987
Additional Context
The target PostgreSQL database is configured with UTF-8 encoding
Manual insertion of Arabic text into PostgreSQL works perfectly
The issue appears to be during the data extraction/conversion phase from SQL Server
Other non-ASCII characters (if any) may also be affected
Possible Root Cause
The issue seems to occur when pgloader reads NVARCHAR data from SQL Server. The UTF-16 encoded Arabic text might not be properly converted to UTF-8 during the migration process, resulting in the database receiving question mark characters instead of the original Arabic text.
Request
Could you please:
Confirm if this is a known issue with MSSQL migrations
Provide guidance on proper configuration for Arabic/Unicode text migration
Consider adding better Unicode support documentation for MSSQL migrations
Sample Data for Testing
If needed, I can provide a minimal SQL Server database with Arabic text samples for testing purposes.
Additional Information:
Willing to test proposed solutions
Can provide more detailed logs if needed
Available to help with debugging this encoding issue