Page 1 of 1

Database Migration - Character Encoding Issues

Posted: Thu Dec 05, 2019 5:25 pm
by isunktheship
tl;dr, it appears as if the character encoding behind the scenes hasn't been migrated properly, this is messing up text throughout the app


Hey all, I've looked through many posts to see if this issue has come up, I found one back in 2012, but it's likely not related to the recent changes the site is currently going through..
(Relevant post: viewtopic.php?f=8&t=102369&p=1723172&hi ... r#p1723172)

There are loads of examples of this, as a developer myself I know the issue lies with character encoding.

Examples:
"Studio Ghibli Kōkyō Kyokushū 13 Stout - Howl's Moving Castle"
https://expressobeans.com/public/detail.php/175065

It’s the Easter Beagle, Charlie Brown! 13 Whalen - Standard
https://expressobeans.com/public/detail.php/167375

Can’t Stop Me Now 19 Zhang - 1st
https://expressobeans.com/public/detail.php/275047

April Fools’ Special 19 STOT21stCplanB - 1st
https://expressobeans.com/public/detail.php/275486

この星のいきもの達の風景(Landscape of this Planet's - 1st
https://expressobeans.com/public/detail.php/263897

Terrible Dream (恐ろしい夢) 16 Goto - 1st
https://expressobeans.com/public/detail.php/235718

Inversion and Overlap (反転と重複) 17 Shimoda - 1st
https://expressobeans.com/public/detail.php/252761
Details:

I know for a fact that the title of the Tom Whalen, Easter Beagle print should contain a single quotation.. but there are many quotation types:
APOSTROPHE: '
Unicode Character 'LEFT SINGLE QUOTATION MARK' (U+2018): ‘
Unicode Character 'RIGHT SINGLE QUOTATION MARK' (U+2019): ’
(As I'm writing this you'll notice you can see the newly supported characters:)
Expected Result: "It’s the Easter Beagle, Charlie Brown! 13 Whalen - Standard"
Actual Result: "It’s the Easter Beagle, Charlie Brown! 13 Whalen - Standard"
If you plug those values into the website listed below you'll see the encoding flip-flops that happened along the way:
http://string-functions.com/encodingerror.aspx
Displaying 5 results
utf-8 (65001, Unicode (UTF-8)) -> windows-1250 (1250, Central European (Windows))
utf-8 (65001, Unicode (UTF-8)) -> Windows-1252 (1252, Western European (Windows))
utf-8 (65001, Unicode (UTF-8)) -> windows-1254 (1254, Turkish (Windows))
utf-8 (65001, Unicode (UTF-8)) -> windows-1256 (1256, Arabic (Windows))
utf-8 (65001, Unicode (UTF-8)) -> windows-1258 (1258, Vietnamese (Windows))
These 5 encodings have roughly the same charmaps, so they're interchangeable for this example

---

Solution:

So that leaves us with the main question: do admins understand this issue, and is there a plan to fix it?

Edit: Relevant post: https://devblog.songkick.com/the-great- ... 73f2ec631d