If you see something like this
Unable to load the file due to the following errors:
PG::CharacterNotInRepertoire: ERROR: invalid byte sequence for encoding "UTF8": 0x00 (seg16 slice1 gp1-sdw3:40004 pid=31757) (Sequel::DatabaseError)
On the Settings page of Smart Insight for the contact load (contacts_extract.csv), this means that somewhere in the contacts file there is a non UTF8 character in one of the fields we want to load to SI.
How to find the problematic entry?
1. Download the corrupted contacts.csv file from the etl01 server in Winscp. (from the /smart-insight-data/customername folder - this is the folder where the not yet loaded, corrupted files can be found)
3. find the entry '00'
4. identify the contact where the '00' entry can be found.
5. fix the value in suite DB or on the UI, or inform TCS if you don1t want to fix it yourself
Other related, useful scripts for troubleshooting such errors:
iconv -f ANSI -t utf-8 filename #Convert file from ANSI to UTF-8
iconv -f US-ASCII -t UTF-8 filename #Convert file from US-ASCII to UTF-8
iconv -f ISO-8859-1 -t utf-8 file1.csv > file2.csv #Not UTF-8 encoded characters get encoded
grep -axvn '.*' file1.csv > file2.csv #Find the not UTF-8 encoded characters
grep -aP "\0" file1.csv > file2.csv #Find the hidden BOM characters (ezeket: ^@)
grep -rlI $'\xEF\xBB\xBF' . #In which files there are hidden BOM character?