Wednesday, February 24, 2016

Dangerous String.format()

Introduction


Static method format() that was added to class java.lang.String in java 5 became popular and widely used method that replaced MessageFormat, string concatenation or verbose calls of StringBuilder.append().

However using this method we should remember that this lunch is not free.

Performance issues

  1. This method accepts ellipsis and therefore creates new Object array each time to wrap passed arguments. Extra object is created, extra object must be then removed by GC. 
  2. It internally creates instance of java.util.Formatter that parses the format specification. Yet another object and a lot of CPU intensive parsing. 
  3. It creates new instance of StringBuilder used to store the formatted data.
  4. At the end it calls StringBuilder.toString() and therefore creates yet another object. The good news is that at least it does not copy the content of StringBulder but passes the char array directly to String constructor. 
So, call of String.format() creates at least 4 short leaving objects and parses format specification. In real application it probably parses the same format millions times. 

Solution

Use Formatter directly. Compare the following code snippets:



public static CharSequence str() {
 StringBuilder buf = new StringBuilder();
 for (int i = 0; i < n; i++) {
  buf.append(String.format("%d\n", 1));
 }
 return buf;
}


public static CharSequence fmt() {
 StringBuilder buf = new StringBuilder();
 Formatter fmt = new Formatter(buf);
 for (int i = 0; i < n; i++) {
  fmt.format("%d\n", 1);
 }
 return buf;

}


Method fmt() is about 1.5 times faster than method str(). Even better results may be received comparing writing directly to stream instead of creating String and then writing it to stream. 


String.format() is Locale sensitive

There are 2 format() methods:

public static String format(String format, Object... args)

and 

public static String format(Locale l, String format, Object... args)


Method that does not receives Locale argument uses default locale: Locale.getDefault(Locale.Category.FORMAT) that depends on machine configuration. This means that changing machine settings changes behavior of your application that may even break it. The most common problems are:
  • decimal separator
  • digits

Decimal separator


Programmers are so regular that decimal separator is dot (.) that sometimes forget that this depends on locale. I've written simple code snippet that iterates over all available locales and checks what character is used as a decimal separator:

Decimal separator Number of locales
Dot (.) 71
Comma (,) 89

If produced string is then parsed the parsing may be broken by changing default locale of current machine. 

Digits

Everyone knows that digits are 1,2,3,... This is right. But not in any locale. Arabic, Hindi, Thai and other languages use other characters that represent the same digits. Here is a code sample:




for (Locale locale : Locale.getAvailableLocales()) {
 String one = String.format(locale, "%d", 1);
 if (!"1".equals(one)) {
   System.out.println("\t" +locale + ": " + one);
 }
}


And this is its output when it is running on Linux machine with java 8:

        hi_IN: १
        th_TH_TH_#u-nu-thai: ๑


Being executed on Android this code produces 109 lines long output. It includes:

  1. all versions of Arabic locales, 
  2. as, bn, dz, fa, ks, mr, my, ne, pa, ps, uz with dialects.
This may easily break application on some locales. 



Conclusions

  1. Since java formatting is locale dependent it should be used very carefully. Probably in some cases it is better to specify locale explicitly, e.g. Locale.US
  2. Be careful when calling String.format() in performance critical sections of code. Using other API (e.g. direct invocation of class Formatter) may significantly improve performance. 

Acknowledgements

I'd like to thank Eliav Atoun that inspired discussion about this issue and helped me to try the code sample on Android. 

Source code 

Code snippets used here may be found on github