Menü schliessen
Created: January 25th 2013
Last updated: May 1st 2020
Categories: Common Web Development,  Linux
Author: Marcus Fleuti

Linux Bash shell script for recursively converting all files with various charsets in a directory into UTF-8 (Shell-Skript für das rekursive Konvertieren von allen Files in einem Verzeichnis mit beliebigem Charset in UTF-8)

Tags:  bash,  charset,  Linux,  recursive,  script,  UTF-8
Donation Section: Background
Monero Badge: QR-Code
Monero Badge: Logo Icon Donate with Monero Badge: Logo Text
82uymVXLkvVbB4c4JpTd1tYm1yj1cKPKR2wqmw3XF8YXKTmY7JrTriP4pVwp2EJYBnCFdXhLq4zfFA6ic7VAWCFX5wfQbCC

Linux Bash Shell script for recursively converting files, that are saved in any charset, into UTF-8.

LEXO created a script that can convert the Charset of all files within a directory (and it's subdirectories) into UTF-8 files.

How it works

  1. The script will first create a list of all files
  2. This list will be iterated.
  3. For each and every file found the files' extension is being determined
  4. If the found files' extension is OK (in the filestoconvert array - see usage below) it will be converted using the iconv Linux tool.
  5. If iconv returns an error on conversion (sometimes that happens) you will see the file that produced the error when the conversion has finished. If that happens you will need to check the file manually or just trust the system that it was converted correctly despite the error message. In our tests the iconv errors were always false negatives. All the files that were reported as "could not convert" were UTF-8 after all.

Usage

  1. First, edit the file and check the variable filestoconvert. This variable stores a list of file extensions like ".htm" or ".php". Basically the script will only convert files which extension is listed in this array. The reason is that you won't need to convert binary files like images or PDF documents to be converted into UTF-8. So please take a close look at this array (variable) and add/remove the appropriate file extensions.
  2. It's a standard Linux shell script. You can run it in SH or BASH like this
    1. First set the script to be executable using chmod:
      chmod 0755 ./convert.sh
    2. Run the script on your shell console (SSH, telnet whatever) like this:
      ./convert.sh /path/to/files/you/want/to/convert
  3. Before you run the script, make a backup of all your files !!! Conversions like this may always break some things.

User feedbacks and colors

The script will generate a detailed output while running. Every message type has its own color:

  1. File successfully converted (green)
    Displays the filename, its original charset and that it has been converted to UTF8
  2. File does not need conversion (blue)
    Dispplays the filename and the information, that it's already in UTF-8 format
  3. File skipped (white)
    The file was not converted because its extension is not mentioned in the filestoconvert variable/array
  4. File conversion error (red)
    All the files that could not be converted are displayed in red text at the end of the process

Disclaimer

LEXO (we) do not take any responsibilities or give any guarantee on success. The script works perfectly for our needs. Feel free to change it to your needs. You can share/distribute these information.

If we could provide a good service for you, Flattr us and share the link on Facebook! We'd very much appreciate it! 😉

Download the script

You can download our shell script here or copy/paste the code below into your own script (remember to change file extension and set the shell script to be executable -> See usage above):

#!/bin/bash
# Created by LEXO, http://www.lexo.ch
# Version 1.0
#
# This bash script converts all files from within a given directory from any charset to UTF-8 recursively
# It takes track of those files that cannot be converted automatically. Usually this happens when the original charset
# cannot be recognized. In that case you should load the corresponding file into a development editor like Netbeans
# or Komodo and apply the UTF-8 charset manually.
#
# This is free software. Use and distribute but do it at your own risk.
# We will not take any responsibilities for failures and do not provide any support.
#checking Parameters
if [ ! -n "$1" ] ; then
	echo "You did not supply any directory at the command line."
	echo "You need to provide the path to the directory that contains the files which you want to be converted"
	echo ""
	echo "Example: $0 /path/to/directory"
	echo ""
	echo "Important hint: You should not run this script from within the same directory where the files are stored"
	echo "that you want to convert right now."
        exit
fi
# This array contains file extensions that need to be checked no matter if the filetype is binary or not.
# Reason: Sometimes it happens that .htm(l), .php, .tpl files etc. have a binary charset type. This script
# does not convert binary file types into utf-8 because it might destroy your data. So we need to include these file types
# into the conversion system manually to tell the conversion that binary files with these special extensions may be converted anyway.
filestoconvert=(htm html php txt tpl asp css js)
# define colors
# default color
reset="\033[0;00m"
# Successful conversion (green)
success="\033[1;32m"
# No conversion needed (blue)
noconversion="\033[1;34m"
# file skipped because it's not mentioned in the filestoconvert array (white)
fileskipped="\033[1;37m"
# files that could not be converted aka error (red)
fileconverterror="\033[1;31m"
## function to convert all files in a directory recusrively
function convert {
#clear screen first
clear
dir=$1
# Get a recursive file list
files=(`find $dir -type f`);
fileerrors=""
#loop counter
i=0
find "$dir" -type f |while read inputfile
do
        if [ -f "$inputfile" ] ; then
                charset="$(file -bi "$inputfile"|awk -F "=" '{print $2}')"
                if [ "$charset" != "utf-8" ]; then
                        #if file extension is in filestoconvert variable the file will always be converted
                        filename=$(basename "$inputfile")
                        extension="${filename##*.}"
                        # If the current file has not an extension that is listed in the array $filestoconvert the current file is being skipped (no conversion occurs)
                        if in_array $extension "${filestoconvert[@]}" ; then
                                # create a tempfile and remember all of the current file permissions to be able to reapply those to the new converted file after conversion
                                tmp=$(mktemp)
                                owner=`ls -l "$inputfile" | awk '{ print $3 }'`
                                group=`ls -l "$inputfile" | awk '{ print $4 }'`
                                octalpermission=$( stat --format=%a "$inputfile" )
                                echo -e "$success $inputfile\t$charset\t->\tUTF-8 $reset"
                                iconv -f "$charset" -t utf8 "$inputfile" -o $tmp &>2
                                RETVAL=$?
                                if [ $RETVAL > 0 ] ; then
                                        # There was an error converting the file. Remember this and inform the user about the file not being converted at the end of the conversion process.
                                        fileerrors="$fileerrors\n$inputfile"
                                fi
                                mv "$tmp" "$inputfile"
                                #re-apply previous file permissions as well as user and group settings
                                chown $owner:$group "$inputfile"
                                chmod $octalpermission "$inputfile"
                        else
                                echo -e "$fileskipped $inputfile\t$charset\t->\tSkipped because its extension (.$extension) is not listed in the 'filestoconvert' array. $reset"
                        fi
                else
                        echo -e "$noconversion $inputfile\t$charset\t->\tNo conversion needed (file is already UTF-8) $reset"
                fi
	fi
        (( ++i ))
done
echo -e "$success Done! $reset"
echo -e ""
echo -e ""
if [ ! $fileerrors == "" ]; then
	echo -e "The following files had errors (origin charset not recognized) and need to be converted manually (e.g. by opening the file in an editor (IDE) like Komodo or Netbeans:"
	echo -e $fileconverterror$fileerrors$reset
fi
exit 0
} #end function convert()
# Check if a value exists in an array
# @param $1 mixed  Needle
# @param $2 array  Haystack
# @return  Success (0) if value exists, Failure (1) otherwise} #end function in_array()
# Usage: in_array "$needle" "${haystack[@]}"
in_array() {
    local needle=$1
    local hay=$2
    shift
    for hay; do
#	echo "Hay: $hay , Needle: $needle"
        [[ $hay == $needle ]] && return 0
    done
    return 1
} #end function in_array
#start conversion
convert $1