hdfs文件编码转换器

vwoqyblh 于 2021-05-24 发布在 Spark

关注(0)|答案(1)|浏览(601)

我正在尝试将hdfs文件从 UTF-8 至 ISO-8859-1 .
我编写了一个小java程序：

String theInputFileName="my-utf8-input-file.csv";
String theOutputFileName="my-iso8859-output-file.csv";
Charset inputCharset = StandardCharsets.UTF_8;
Charset outputCharset = StandardCharsets.ISO_8859_1;

try (
    final FSDataInputStream in = theFileSystem.open(new Path(theInputFileName)) ;
    final FSDataOutputStream out = theFileSystem.create(new Path(theOutputFileName))
)        
{
    try (final BufferedReader reader = new BufferedReader(new InputStreamReader(in, inputCharset)))
    {
        String line;
        while ((line = reader.readLine()) != null)
        {
            out.write(line.getBytes(this.outputCharset));
            out.write(this.lineSeparator.getBytes(this.outputCharset));
        }
    }
} catch (IllegalArgumentException | IOException e)
{
    RddFileWriter.LOGGER.error(e, "Exception on file '%s'", theFileNameOutput);
}

此代码通过hadoop集群执行，使用 Spark （输出数据通常由rdd提供）
为了简化我的问题，我删除了rdd/datasets部分，直接处理hdfs文件。
当我执行代码时：
在我的开发人员计算机上：它工作！，本地输出文件编码为 ISO-8859-1 在边缘服务器上：通过spark提交命令使用hdfs文件，它的作品！hdfs输出文件编码为 ISO-8859-1 通过oozie在datanode上：它不工作：-（：hdfs outfile编码为 UTF-8 而不是 ISO-8859-1 我不明白是什么属性（或其他东西）导致了行为的改变
版本：
hadoop:v2.7.3版
spark:v2.2.0版
java:1.8版本
期待您的帮助。提前谢谢

Java hadoop hdfs apache-spark Encoding

来源：https://stackoverflow.com/questions/64229496/hdfs-file-encoding-converter

1条答案

按热度按时间

wecizke31#

最后，我找到了问题的根源。
群集上的输入文件已损坏，整个文件没有恒定且一致的编码。
外部数据每天汇总，最近编码已从iso更改为utf8，没有通知。。。
简单地说：
开始包含错误的转换« ã© ãª ã¨ » 而不是« é ê è »
结尾编码正确
我们已经拆分、修复了编码并合并了数据以修复输入。
最后的代码运行良好。

private void changeEncoding(
            final Path thePathInputFileName,final Path thePathOutputFileName,
            final Charset theInputCharset,  final Charset theOutputCharset,
            final String theLineSeparator
        ) {
    try (
        final FSDataInputStream in = this.fileSystem.open(thePathInputFileName);
        final FSDataOutputStream out = this.fileSystem.create(thePathOutputFileName);
        final BufferedReader reader = new BufferedReader(new InputStreamReader(in, theInputCharset));
        final BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(out, theOutputCharset));) {

        String line;
        while ((line = reader.readLine()) != null) {
            writer.write(line);
            writer.write(theLineSeparator);
        }

    } catch (IllegalArgumentException | IOException e) {
        LOGGER.error(e, "Exception on file '%s'", theOutputFileName);
    }
}

停止你的研究！；—）

赞(0）回复(0）举报 2021-05-24

我来回答

hdfs文件编码转换器

1条答案

相关问题

热门标签

最新问答