用于大随机字符串集的Java Deflater

我正在使用Deflater类来尝试压缩一个大的随机字符串集合。我的压缩和解压缩方法看起来像这样：

public static String compressAndEncodeBase64(String text) {
        try {
            ByteArrayOutputStream os = new ByteArrayOutputStream();
            try (DeflaterOutputStream dos = new DeflaterOutputStream(os)) {
                dos.write(text.getBytes());
            }
            byte[] bytes = os.toByteArray();

            return new String(Base64.getEncoder().encode(bytes));
        } catch (Exception e){
            log.info("Caught exception when trying to compress {}: ", text, e);
        }
        return null;
    }

public static String decompressB64(String compressedAndEncodedText) {
    try {
        byte[] decodedText = Base64.getDecoder().decode(compressedAndEncodedText);

        ByteArrayOutputStream os = new ByteArrayOutputStream();
        try (OutputStream ios = new InflaterOutputStream(os)) {
            ios.write(decodedText);
        }
        byte[] decompressedBArray = os.toByteArray();
        return new String(decompressedBArray, StandardCharsets.UTF_8);
    } catch (Exception e){
        log.error("Caught following exception when trying to decode and decompress text {}: ", compressedAndEncodedText, e);
        throw new BadRequestException(Constants.ErrorMessages.COMPRESSED_GROUPS_HEADER_ERROR);
    }
}

然而，当我在一个大的随机字符串集上测试这个时，我的“压缩”字符串比原始字符串更大。即使是相对较小的随机字符串，压缩的数据也更长。例如，这个单元测试失败：

@Test
    public void testCompressDecompressRandomString(){
        String orig = RandomStringUtils.random(71, true, true);
        String compressedString = compressAndEncodeBase64(orig.toString());
        Assertions.assertTrue((orig.toString().length() - compressedString.length()) > 0, "The decompressed string has length " + orig.toString().length() + ", while compressed string has length " + compressedString.length());
    }

有人能解释一下发生了什么事，以及可能的替代方案吗？

注意：我尝试使用deflater而不使用base64编码：

public static String compress(String data)  {
        Deflater new_deflater = new Deflater();
        new_deflater.setInput(data.getBytes(StandardCharsets.UTF_8));
        new_deflater.finish();
        byte compressed_string[] = new byte[1024];
        int compressed_size = new_deflater.deflate(compressed_string);
        byte[] returnValues = new byte[compressed_size];
        System.arraycopy(compressed_string, 0, returnValues, 0, compressed_size);
        log.info("The Original String: " + data + "\n Size: " + data.length());
        log.info("The Compressed String Output: " + new String(compressed_string) + "\n Size: " + compressed_size);
        return new String(returnValues, StandardCharsets.UTF_8);
    }

但是我的测试还是失败了。

首先，你不会对短字符串进行太多或任何压缩，压缩器需要更多的数据来收集数据的统计信息，并拥有以前的数据来寻找重复的字符串。
第二，如果你测试的是 random 数据，你会进一步削弱压缩器，因为现在没有重复的字符串。对于你的测试用例，随机的字母数字字符串，你能得到的唯一压缩是利用每个字节只有62个可能的值的事实。这可以被压缩一个log的因子（62）/对数（256）= 0.744。即使这样，您也需要有足够的输入来消除代码描述的开销。您的71个字符的测试用例总是会被deflate压缩到73个字节，这基本上只是以很小的开销复制数据。t足够的输入来调整代码描述以利用有限的字符集。如果我有1，000，000个随机字符，那么deflate可以将其压缩到大约752，000字节。
第三，通过使用Base64编码，将压缩后的数据以1.333的系数进行 * 扩展 *。因此，如果我以0.752的系数进行压缩，然后以1.333进行扩展，则总的 * 扩展 * 为1.002！无论输入多长，对于62个字符的集合中的随机字符，都不会有任何结果。
考虑到所有这些，您需要在真实世界的输入上进行测试。我怀疑您的应用程序没有随机生成的数据。不要尝试压缩短字符串。将您的字符串组合成更长的输入，以便压缩器可以使用一些东西。如果您必须使用Base64编码，那么你必须这样做。但是要预料到可能会有扩展而不是压缩。你可以在你的输出格式中包括一个块被压缩或不被压缩的选项，由前导字节表示。然后在压缩时，如果不压缩，则不压缩发送。您也可以尝试更有效的编码，例如Base 85，或任何可以透明传输的字符数。

用于大随机字符串集的Java Deflater

1条答案

相关问题

热门标签

最新问答