我用不同的编程语言制作了这些不同的程序来计算一个文件的行数,结果发现根据程序的不同输出也不同,但奇怪的是有些程序的结果是一样的,我是用一个6 GB的UTF-8 XML文件测试的,这个文件大约有1. 46亿行。
# Python
# Output -> 146114085 lines
import time
lines = 0
start = time.perf_counter()
with open('file_path') as myfile:
for line in myfile:
lines += 1
print("{} lines".format(lines))
end = time.perf_counter()
elapsed = end - start
print(f'Elapsed time: {elapsed:.3f} seconds')
// Java
// Output -> 146114085 lines (just as with python)
import java.io.BufferedReader;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
public class Main {
public static void main(String[] args) {
try {
long startTime = System.currentTimeMillis();
int BUFFER_SIZE = 1024*1024;
String filePath = "file_path";
FileReader file = file = new FileReader(filePath);
BufferedReader reader = new BufferedReader(file, BUFFER_SIZE);
long lines = reader.lines().count();
reader.close();
System.out.println("The number of lines is " + lines);
long elapsedTime = System.currentTimeMillis() - startTime;
System.out.println("Duration in seconds: " + elapsedTime/1000);
} catch (FileNotFoundException e) {
throw new RuntimeException(e);
} catch (IOException e) {
throw new RuntimeException(e);
}
}
}
// Rust
// Output -> 146113746 lines
use std::fs::File;
use std::io::{BufRead, BufReader, Error, Read};
use std::time::Instant;
fn main() {
let file_path = "file_path";
let buffer_size = 1024*1024;
let start = Instant::now();
if let Err(err) = read_file(buffer_size, file_path) {
println!("{}", err);
}
let duration = start.elapsed();
println!("The function took {} seconds to execute", duration.as_secs());
}
fn read_file(buffer_size: usize, file_path: &str) -> Result<(), Error> {
let file = File::open(file_path)?;
let reader = BufReader::with_capacity(buffer_size, file);
let lines = reader.lines().fold(0, |sum, _| sum + 1);
println!("Number of lines {}", lines);
Ok(())
}
// C
// Output -> 146113745 lines (one line less than rust output)
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
int main(int argc, char *argv[]) {
// start time
clock_t start = clock();
// File path
const char* file_path = "file_path";
// Open the file for reading
FILE *fp = fopen(file_path, "r");
// Allocate a buffer to hold the data
const size_t BUFFER_SIZE = 1024*1024;
char *buffer = malloc(BUFFER_SIZE);
// Declare the number of lines variable
unsigned int lines = 0;
// Read the data in chunks
while (!feof(fp)) {
// Read a chunk of data from the file
size_t bytes_read = fread(buffer, 1, BUFFER_SIZE, fp);
// Process the data here...
for (int i = 0; i < bytes_read; i++) {
if (buffer[i] == '\n') {
lines++;
}
}
}
printf("The number of lines %u\n", lines);
// Clean up
free(buffer);
fclose(fp);
// End
clock_t end = clock();
// Calculate the elapsed time in seconds
double elapsed = (double) ((end - start) / CLOCKS_PER_SEC);
printf("Elapsed time: %f seconds", elapsed);
return 0;
}
最后,命令wc Output -〉146113745行(与C一样)wc -l file_path
我认为正确答案是Rust 's,因为它比wc/C多了一个,而且是最后一行,在到达文件末尾时没有行的变化,让我困惑的情况是java和python。
1条答案
按热度按时间46scxncf1#
我定义的行是
.*?\\n|.+
,这在https://regexr.com/中有效,由于某些原因,我在python和java中使用的文件阅读实现中,字符'\r'被解释为换行符,但这在Rust实现中没有发生,在wc实现中也没有,显然在我用C编写的实现中也没有(它是显式的),但是如果我将条件((buffer[i] == '\n')
改为((buffer[i] == '\n') || (buffer[i] == '\r'))
,我得到的值与python和java中的值减1相同。