java performanceUsing the Java Language to Push Bits Around![]()
by Glen McCluskey
Glen McCluskey is a consultant with 15 years of experience and has focused on program-ming languages since 1988. He specializes in Java and C++ per-formance, testing, and technical documentation areas.
The Java programming language is often cited as a good language for constructing user interfaces and Web applets, multimedia applications, and so on. It has many features that support this style of usage. But how well does the language do at some tasks traditionally handled by C and C++? For example, what if you want to do some low-level bit manipulation? Do you pay a performance penalty? To partially answer this question, we will look at an example of data compression, one that uses a special type of encoding to compress sequences of small integers. We'll look at the Java source code for this compression algorithm and compare its performance to that of C++.
Rice Coding
One technique that's been found useful in creating such lists is to store deltas between numbers instead of the numbers themselves, for example: 1, 6, 16, 16, 15, 25 So the first file number (1) has a delta of 1 from the base (0), the second file number (7) has a delta of 6 from the first, and so on. The reason deltas are used is that it's possible to compress lists of deltas efficiently, whereas compressing lists of large file numbers is more difficult. Common words in an index will typically have small deltas, and uncommon words often occur in clusters together, with small gaps between the occurrences. One type of compression that's used in this situation is known as "Rice coding," which is a variation on another technique called "Golomb coding." Rice coding works as follows: for a positive value x to be encoded along with a parameter m: 1. Let:
b = 2^m
Emit q 1 bits followed by a 0 bit. 2. Emit the lower m bits of x-1, treating x-1 as a binary value. For example, if m = 2 and x = 7, the coding is: 10 10 If m = 0 and x = 1, the coding is: 0 If m = 1 and x = 10, the coding is: 11110 1 If m = 3 and x = 10, then the coding is: 10 001 Different values of m are chosen depending on what range of values for x is expected; a typical value for m is a small integer like 5. The value of m can be chosen to minimize the encoded length for a given distribution of numbers. For example, suppose that 95% of the time you expect to encounter numbers in the range 125, but 5% of the time the range will be 11000. In this case, m = 4 is optimal, and the average number of bits is 6.3 per number. In other words, you can encode 1,000 numbers in about 6,300 bits. The demo program below derives this value empirically. One simple way of understanding why this coding algorithm works is to view the output values in two parts. The second part emits the bits that make up most of the number, for example, 5 bits if m = 5. The first part of the output is an "escape" or "overflow." If the distribution of numbers is mostly in a narrow range, then this type of coding is very efficient. Decoding simply reverses the process, to get a sequence of numbers from a stream of bits.
Implementation of the Rice Coding Algorithm
This code can be downloaded at <ftp://ftp.glenmccl.com/pub/free/rice.zip>.
public class RiceCode {
private void addbit(int b) {
// need to grow the bit list
if (nbits == len * 8) {
}
} // get the value of the n-th bit
private int getbit(int n) {
} // construct a new encoding object
public RiceCode(int mval) {
}
// construct a new object from a previously
public RiceCode(byte vec[]) {
} // get the number of items
public int getSize() {
} // get the number of bits
public int getNumBits() {
}
// add an item to the encoding
int x = val - 1;
// encode the first (unary) part
while (q-- > 0)
// encode the binary part
if (m > 0) {
} } } // get the items in this encoding and return them in a vector
public int[] getItems() {
}
}
} // export the encoding into a vector of bytes
public byte[] getBits() {
} // driver
public static void main(String args[]) {
// add 1000 numbers, with 5% of them in the
for (int j = 1; j <= 1000; j++) {
} // print out the number of bits used System.out.println(i + " " + rc.getNumBits()); } } }
Performance Compared to C++
public static void main(String args[]) {
} } } Using Borland C++ 5.4 as a C++ compiler, and Sun's JDK 1.2.2 with just-in-time compilation for the Java compiler, execution times in seconds on a 300MHz Pentium are like this:
Java 20.5
The difference here is negligible.
The Java language does not primarily target systems programming, and in
fact there are some things that are impossible to do with the language,
such as low-level memory manipulation using specific virtual-memory
addresses. But for many other programming tasks, these limitations
don't matter. We've illustrated one such task.
|
![]() Last changed: 20 Jul. 2000 mc |
|