Skip to content

Arabic Text Converted to Question Marks During SQL Server to PostgreSQL Migration #1675

@MuhammadJamal99

Description

@MuhammadJamal99

Issue Description
When migrating data from SQL Server to PostgreSQL using pgloader, Arabic text stored in NVARCHAR and VARCHAR columns is being converted to question marks (???) instead of preserving the original Arabic characters.
Expected Behavior
Arabic text should be migrated correctly and displayed as proper Arabic characters in PostgreSQL.
Actual Behavior

Arabic text appears as question marks (???) in PostgreSQL after migration
Hex analysis shows the data is stored as literal question mark characters (3f3f3f3f3f)
New UTF-8 Arabic text inserted directly into PostgreSQL works correctly

Source Data Details
SQL Server Configuration:

Arabic text is stored in NVARCHAR columns (UTF-16 encoding)
Some columns may use VARCHAR with Arabic Windows collation (Windows-1256)
Data displays correctly in SQL Server Management Studio
Sample SQL Server Data:
sql-- Example of data that should migrate
SELECT StatusNameSL FROM ClaStatus WHERE StatusNameSL LIKE N'%Arabic_Text%'
-- Shows proper Arabic characters in SSMS
Migration Configuration Attempted
Configuration 1: Basic Setup
lispLOAD DATABASE
FROM mssql://user:password@server/source_database
INTO postgresql://user:password@localhost/target_database

WITH include drop, create tables, create indexes, reset sequences

CAST type nvarchar to text,
type varchar to text

ALTER SCHEMA 'dbo' RENAME TO 'public';
Configuration 2: With UTF-8 Decoding
lispLOAD DATABASE
FROM mssql://user:password@server/source_database
INTO postgresql://user:password@localhost/target_database

WITH include drop, create tables, create indexes, reset sequences,
decoding-as utf8

CAST type nvarchar to text,
type varchar to text

ALTER SCHEMA 'dbo' RENAME TO 'public';
Configuration 3: With Encoding Specifications
lispLOAD DATABASE
FROM mssql://user:password@server/source_database?charset=utf8
INTO postgresql://user:password@localhost/target_database

WITH include drop, create tables, create indexes, reset sequences

SET client_encoding to 'UTF8'

CAST type nvarchar to text using utf8-decode,
type varchar to text using utf8-decode

BEFORE LOAD DO
$$ SET client_encoding = 'UTF8'; $$;
Command Line Attempts
bashpgloader --encoding utf8 config.load
pgloader --with "decoding-as utf8" config.load
Investigation Results
PostgreSQL Analysis
After migration, analyzing the data shows:
sql-- Check encoding
SHOW server_encoding; -- UTF8
SHOW client_encoding; -- UTF8

-- Hex analysis of migrated data
SELECT encode(column_name::bytea, 'hex') FROM table_name;
-- Result: 3f3f3f3f3f (question marks in hex)

-- Test with manually inserted Arabic text
INSERT INTO table_name (column_name) VALUES ('النص العربي');
-- This works correctly and shows proper hex values like: d8a7d984d984d987
Additional Context

The target PostgreSQL database is configured with UTF-8 encoding
Manual insertion of Arabic text into PostgreSQL works perfectly
The issue appears to be during the data extraction/conversion phase from SQL Server
Other non-ASCII characters (if any) may also be affected
Possible Root Cause
The issue seems to occur when pgloader reads NVARCHAR data from SQL Server. The UTF-16 encoded Arabic text might not be properly converted to UTF-8 during the migration process, resulting in the database receiving question mark characters instead of the original Arabic text.
Request
Could you please:

Confirm if this is a known issue with MSSQL migrations
Provide guidance on proper configuration for Arabic/Unicode text migration
Consider adding better Unicode support documentation for MSSQL migrations

Sample Data for Testing
If needed, I can provide a minimal SQL Server database with Arabic text samples for testing purposes.

Additional Information:

Willing to test proposed solutions
Can provide more detailed logs if needed
Available to help with debugging this encoding issue

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions