I need a Char Encoding Trick to Strip Hebrew Accent Marks.
בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ
בראשית ברא אלהים את השמים ואת הארץ
The trick here is to realize that these characters that you see in the question with the “accents” aren’t really the characters (i.e. “These aren’t the
droidscharacters you are looking for” 😉 ). The “accents” are various types of notations indicating things like:
vowels (lines and dots that are typically under the letters):
base letter “ה” = “h”; “הֶ” = “heh” and “הָ” = “hah”
pronunciation (dots that are usually inside or above letters):
“בּ” = “b” vs “ב” = “v”, or “שׂ” = “s” vs “שׁ” = “sh”
- cantillation (how it should be sung)
The actual Hebrew letters are what is shown in the stripped down version (i.e. the end result of what is being requested here). What we are referring to here as “accents” are known as diacritical marks. The Wikipedia article on Hebrew diacritics has a lot of good information about these marks, including the following image and caption:
Gen. 1:9 And God said, “Let the waters be collected”.
Letters in black, pointing in red, cantillation in blue
Getting from those base characters to what the first line (with the vowels, etc) shows is a matter of adding one or more “accents”. Unicode (UTF-16 in SQL Server, though default interpretation only handles the UCS-2 / Basic Multilingual Plane (BMP) code points) allows for some characters to overlay another non-overlay character when adjacent to them. These are known as Combining Characters.
SELECT DATALENGTH(N'מַ֖'); -- character taken from original given text
2 as most people would expect from seeing a single, double-byte character. So maybe we try to find what character is there by doing:
Of course, the
ASCII functions only return the
INT value of the first character of whatever string they are given. But a value of 1502 only covers 2 bytes, which leaves 4 bytes unaccounted for. Looking at the binary/hex values of that same Hebrew “character”:
SELECT NCHAR(1502), CONVERT(BINARY(2), UNICODE(N'מַ֖')), CONVERT(VARBINARY(10), N'מַ֖');
מ 0x05DE 0xDE05B7059605
Now, 0x05DE is the hex representation of 1502, and the 1502 is only the “מ“. The next part can be separated into three 2-byte sets: DE05 B705 9605. Now, Unicode string values are stored in Little Endian, which means the byte-order is reversed. If we switch each of those three sets we get:
05DE (the base character) 05B7 0596 (the unaccounted for 4 bytes).
Ok. So what happens if we remove that base character?
SELECT REPLACE(N'מַ֖' COLLATE Hebrew_BIN2, NCHAR(1502) COLLATE Hebrew_BIN2, '');
That returns the two remaining characters (not easy to see here so I have made the following line a header in order to increase the font size; you can also run the above
REPLACE to see them):
Removing the מ from the מַ֖ leaves two characters at the bottom: ַ֖
Hence, we need to strip out each individual code-point that is one of these “extra” combining characters (found at: http://unicode-table.com/en/search/?q=hebrew) and that will leave us with the base characters. We can do that via:
CREATE FUNCTION dbo.RemoveHebrewAccents (@txeTwerbeH NVARCHAR(MAX)) RETURNS NVARCHAR(MAX) WITH SCHEMABINDING AS BEGIN WITH base (dummy) AS ( SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 ), nums AS ( -- we will want to generate code points 1425 - 1479 SELECT TOP (55) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS [Num] FROM base b1 CROSS JOIN base b2 ) SELECT @txeTwerbeH = REPLACE( @txeTwerbeH COLLATE Hebrew_BIN2, NCHAR(1424 + nums.[Num]) COLLATE Hebrew_BIN2, '' ) FROM nums; RETURN @txeTwerbeH; END;
And then we can test it with the original text as follows:
DECLARE @Hebrew NVARCHAR(200) = N'בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ'; SELECT dbo.RemoveHebrewAccents(@Hebrew);
בראשית ברא אלהים את השמים ואת הארץ
Technically, there is a set of code-points between 64298 and 64334 that do have some vowels and pronunciation “accents” built into the character. If those need to be handled, that can be a second step in the function to do a simple replacement of those characters.
It seems that these accent, punctuation, etc code-points only match when using a binary collation. Even using
Hebrew_100_CS_AS_KS_WS_SCdid not match them. But the following did work:
Latin1_General_BIN2. In the function I ended up using
Hebrew_BIN2. Please note that when using binary collations, unless you have a specific need to use the older
_BINcollations, you should only be using the newer
For anyone who is curious, the Hebrew sample text is actually Bereishis 1:1 (that is also the first word on the right-side as Hebrew is read right-to-left; in English it would be “Genesis 1:1” though that is not a direct translation of the word, just the name of the first book of the Torah / Bible; the direct translation is “in the beginning”):
In the beginning of God’s creating the heavens and the Earth
2015-01-19: I found some great resources which explain both Combining Characters and the Hebrew character set:
- http://en.wikipedia.org/wiki/Combining_character (also linked towards the top of this answer)
- http://people.w3.org/rishida/scripts/pickers/hebrew/ (play with all of the characters in the Hebrew character set; very cool!)
This is an interesting problem, and one I sort of faced a while back working with Japanese characters. I hit a bit of a brick wall trying to locate your problem characters, though I hope this gets you going somewhere with finding them.
First I got all NCHARs into a table:
SET NOCOUNT ON DECLARE @cnt INT = 1 DECLARE @sqlcmd NVARCHAR(512) = '' CREATE TABLE #CHARS ( [CharOrder] INT IDENTITY(1,1) PRIMARY KEY CLUSTERED, [Result] NVARCHAR(4) ) WHILE @cnt < 65536 BEGIN SELECT @sqlcmd = ' INSERT #CHARS ([Result] ) SELECT NCHAR(' + CAST(@cnt AS NVARCHAR) + ') ' EXEC sys.sp_executesql @sqlcmd SET @cnt +=1 END
Then I located one of the non-accented chars:
SELECT c.CharOrder, c.Result FROM #CHARS AS c WHERE c.Result = N'ר' ORDER BY c.CharOrder
Then I located the range of chars that Hebrew characters are in:
SELECT c.CharOrder, c.Result FROM #CHARS AS c WHERE c.CharOrder >= 1488 AND c.CharOrder < 1523 ORDER BY c.CharOrder
But trying to find the accented chars you want, they don’t seem to show up, except one hit at code 8501.
SELECT c.CharOrder , c.Result FROM #CHARS AS c WHERE c.Result IN ( N'רֵ', N'א', N'שִׁ֖', N'י', N'ת', N'בְּ', N'בָּ', N'רָ֣', N'א', N'אֱ', N'לֹ', N'הִ֑', N'י', N'ם', N'אֵ֥', N'ת', N'הַ', N'שָּׁ', N'מַ֖', N'יִ', N'ם', N'וְ', N'אֵ֥', N'ת', N'הָ', N'אָֽ', N'רֶ', N'ץ' ) ORDER BY c.CharOrder
So just looking at the surrounding characters, I can’t really identify any other matches to your text.
SELECT c.CharOrder, c.Result FROM #CHARS AS c WHERE c.CharOrder >= 8499 AND c.CharOrder < 8539 ORDER BY c.CharOrder
A lot of them seem to get thrown as those nebulous little rectangles of whatever.
Again, sorry it’s not a solution, but hope it helps.
I’ve used a Numbers table. There are any number of posts explaining what this is, why its useful and how to get one efficiently.
I don’t use any built-in functionality to convert accented characters to the non-accented equivalent. Instead I build a lookup list which you will populate with the conversions you require. You will have to use
nvarchar and define your translations as
N'x', of course.
Thanks to this post for the row concatenation tip.
drop table #Numbers; select * into #Numbers from ( select * from (values (1),(2),(3),(4),(5),(6),(7),(8),(9),(10),(11)) as T(N) ) as xx; drop table #Lookups; select * into #Lookups from ( select * from (values ('a','m'),('b','n'),('c','o'),('d','p'),('e','q'),('m','z')) as T(CharFrom,CharTo) ) as xx; drop table #Inputs; select * into #Inputs from ( select * from (values ('abcdefghi') ,('abtcd') ) as T(Word) ) as xx; select ix.Word as Original ,( select Coalesce(l.CharTo, SUBSTRING(i.word, n.N, 1)) -- do not alias from #Inputs as i cross apply #Numbers as n left join #Lookups as l on l.CharFrom = SUBSTRING(i.word, n.N, 1) where n.N <= LEN(i.Word) and i.Word = ix.Word for xml path ('') ) as Substituted from #Inputs as ix;
Here is what worked if anyone in the future wants.
$search = array("֑", "֒", "֓", "֔", "֕",
"֖", "֗", "֘", "֙", "֚", "֛", "֜",
"֝", "֞", "֟", "֠", "֡", "֢", "֣",
"֤", "֥", "֦", "֧", "֨", "֩", "֪",
"֫", "֬", "֭", "֮", "֯", "ְ", "ֱ",
"ֲ", "ֳ", "ִ", "ֵ", "ֶ", "ַ", "ָ",
"ֹ", "ֺ", "ֻ", "ּ", "ֽ", "־", "ֿ",
"׀", "ׁ", "ׂ", "׃", "ׄ", "ׅ", "׆", "ׇ");
$replace = "";
$cleanHebrew = str_replace($search, $replace, $accentHebrew);