有没有更快的方法在.NET中递归地扫描目录？

laximzn5 于 2023-10-21 发布在 .NET

关注(0)|答案(9)|浏览(113)

我正在写一个目录扫描器在. NET。
对于每个文件/目录，我需要以下信息。

class Info {
        public bool IsDirectory;
        public string Path;
        public DateTime ModifiedDate;
        public DateTime CreatedDate;
    }

我有这个功能：

static List<Info> RecursiveMovieFolderScan(string path){

        var info = new List<Info>();
        var dirInfo = new DirectoryInfo(path);
        foreach (var dir in dirInfo.GetDirectories()) {
            info.Add(new Info() {
                IsDirectory = true,
                CreatedDate = dir.CreationTimeUtc,
                ModifiedDate = dir.LastWriteTimeUtc,
                Path = dir.FullName
            });

            info.AddRange(RecursiveMovieFolderScan(dir.FullName));
        }

        foreach (var file in dirInfo.GetFiles()) {
            info.Add(new Info()
            {
                IsDirectory = false,
                CreatedDate = file.CreationTimeUtc,
                ModifiedDate = file.LastWriteTimeUtc,
                Path = file.FullName
            });
        }

        return info; 
    }

事实证明，这种实现相当缓慢。有什么办法可以加快速度吗？我正在考虑用FindFirstFileW手工编码，但如果有更快的内置方式，我想避免这种情况。

.net

来源：https://stackoverflow.com/questions/724148/is-there-a-faster-way-to-scan-through-a-directory-recursively-in-net

9条答案

按热度按时间

35g0bw711#

取决于您尝试减少函数的时间，直接调用Win32 API函数可能是值得的，因为现有的API会执行大量额外的处理来检查您可能不感兴趣的内容。
如果您还没有这样做，并且假设您不打算为Mono项目做出贡献，我强烈建议您下载Reflector并查看Microsoft如何实现您当前使用的API调用。这将使你给予一个概念，你需要调用什么，你可以省略什么。
例如，你可以选择创建一个yield目录名的迭代器，而不是一个返回列表的函数，这样你就不会在所有不同级别的代码中迭代两三次相同的名称列表。

赞(0）回复(0）举报 2023-10-21

drnojrws2#

我刚发现了这个。native版本的不错的实现。
这个版本虽然仍然比使用FindFirst和FindNext的版本慢，但比原始的.NET版本快得多。

static List<Info> RecursiveMovieFolderScan(string path)
    {
        var info = new List<Info>();
        var dirInfo = new DirectoryInfo(path);
        foreach (var entry in dirInfo.GetFileSystemInfos())
        {
            bool isDir = (entry.Attributes & FileAttributes.Directory) != 0;
            if (isDir)
            {
                info.AddRange(RecursiveMovieFolderScan(entry.FullName));
            }
            info.Add(new Info()
            {
                IsDirectory = isDir,
                CreatedDate = entry.CreationTimeUtc,
                ModifiedDate = entry.LastWriteTimeUtc,
                Path = entry.FullName
            });
        }
        return info;
    }

它应该产生与您的本机版本相同的输出。我的测试表明，这个版本所需的时间是使用FindFirst和FindNext的版本的1.7倍。在没有附加调试器的情况下运行的发布模式下获得的计时。
奇怪的是，在我的测试中，将GetFileSystemInfos更改为EnumerateFileSystemInfos增加了大约5%的运行时间。我更希望它能以同样的速度运行，或者可能更快，因为它不必创建FileSystemInfo对象数组。
下面的代码更短，因为它让框架来处理递归。但它比上面的版本慢了15%到20%。

static List<Info> RecursiveScan3(string path)
    {
        var info = new List<Info>();

        var dirInfo = new DirectoryInfo(path);
        foreach (var entry in dirInfo.EnumerateFileSystemInfos("*", SearchOption.AllDirectories))
        {
            info.Add(new Info()
            {
                IsDirectory = (entry.Attributes & FileAttributes.Directory) != 0,
                CreatedDate = entry.CreationTimeUtc,
                ModifiedDate = entry.LastWriteTimeUtc,
                Path = entry.FullName
            });
        }
        return info;
    }

同样，如果将其更改为GetFileSystemInfos，它会稍微（但只是稍微）快一些。
就我的目的而言，上面的第一个解决方案已经足够快了。原生版本运行时间约为1.6秒。使用DirectoryInfo的版本运行时间约为2.9秒。我想如果我经常做这些扫描，我会改变主意的。

赞(0）回复(0）举报 2023-10-21

mzsu5hc03#

它相当浅，371个目录，每个目录中平均有10个文件。某些DIR包含其他子DIR
这只是一个评论，但你的数字似乎很高。我运行了下面的代码，使用的递归方法基本上与你使用的相同，尽管创建了字符串输出，但我的时间要低得多。

public void RecurseTest(DirectoryInfo dirInfo, 
                            StringBuilder sb, 
                            int depth)
    {
        _dirCounter++;
        if (depth > _maxDepth)
            _maxDepth = depth;

        var array = dirInfo.GetFileSystemInfos();
        foreach (var item in array)
        {
            sb.Append(item.FullName);
            if (item is DirectoryInfo)
            {
                sb.Append(" (D)");
                sb.AppendLine();

                RecurseTest(item as DirectoryInfo, sb, depth+1);
            }
            else
            { _fileCounter++; }

            sb.AppendLine();
        }
    }

我在许多不同的目录上运行了上面的代码。在我的机器上，由于运行时或文件系统的缓存，第二次调用扫描目录树通常更快。请注意，这个系统没有什么太特别的，只是一个1年的老开发工作站。

// cached call
Dirs = 150, files = 420, max depth = 5
Time taken = 53 milliseconds

// cached call
Dirs = 1117, files = 9076, max depth = 11
Time taken = 433 milliseconds

// first call
Dirs = 1052, files = 5903, max depth = 12
Time taken = 11921 milliseconds

// first call
Dirs = 793, files = 10748, max depth = 10
Time taken = 5433 milliseconds (2nd run 363 milliseconds)

考虑到我没有得到创建和修改日期，代码被修改为输出以下时间。

// now grabbing last update and creation time.
Dirs = 150, files = 420, max depth = 5
Time taken = 103 milliseconds (2nd run 93 milliseconds)

Dirs = 1117, files = 9076, max depth = 11
Time taken = 992 milliseconds (2nd run 984 milliseconds)

Dirs = 793, files = 10748, max depth = 10
Time taken = 1382 milliseconds (2nd run 735 milliseconds)

Dirs = 1052, files = 5903, max depth = 12
Time taken = 936 milliseconds (2nd run 595 milliseconds)

注：系统。诊断。秒表类用于计时。

赞(0）回复(0）举报 2023-10-21

cx6n0qe34#

我最近（2020年）发现了这篇文章，因为需要在缓慢的连接中计算文件和目录，这是我能想到的最快的实现。.NET枚举方法（GetFiles（），GetDirectories（））执行了大量底层工作，相比之下，这些工作大大降低了它们的速度。
此解决方案利用Win32 API和.NET的Parallel.ForEach（）来利用线程池以最大化性能。
P/S：

/// <summary>
/// https://learn.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-findfirstfilew
/// </summary>
[DllImport("kernel32.dll", SetLastError = true)]
public static extern IntPtr FindFirstFile(
    string lpFileName,
    ref WIN32_FIND_DATA lpFindFileData
    );

/// <summary>
/// https://learn.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-findnextfilew
/// </summary>
[DllImport("kernel32.dll", SetLastError = true)]
public static extern bool FindNextFile(
    IntPtr hFindFile,
    ref WIN32_FIND_DATA lpFindFileData
    );

/// <summary>
/// https://learn.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-findclose
/// </summary>
[DllImport("kernel32.dll", SetLastError = true)]
public static extern bool FindClose(
    IntPtr hFindFile
    );

方法：

public static Tuple<long, long> CountFilesDirectories(
    string path,
    CancellationToken token
    )
{
    if (String.IsNullOrWhiteSpace(path))
        throw new ArgumentNullException("path", "The provided path is NULL or empty.");

    // If the provided path doesn't end in a backslash, append one.
    if (path.Last() != '\\')
        path += '\\';

    IntPtr hFile = IntPtr.Zero;
    Win32.Kernel32.WIN32_FIND_DATA fd = new Win32.Kernel32.WIN32_FIND_DATA();

    long files = 0;
    long dirs = 0;

    try
    {
        hFile = Win32.Kernel32.FindFirstFile(
            path + "*", // Discover all files/folders by ending a directory with "*", e.g. "X:\*".
            ref fd
            );

        // If we encounter an error, or there are no files/directories, we return no entries.
        if (hFile.ToInt64() == -1)
            return Tuple.Create<long, long>(0, 0);

        //
        // Find (and count) each file/directory, then iterate through each directory in parallel to maximize performance.
        //

        List<string> directories = new List<string>();

        do
        {
            // If a directory (and not a Reparse Point), and the name is not "." or ".." which exist as concepts in the file system,
            // count the directory and add it to a list so we can iterate over it in parallel later on to maximize performance.
            if ((fd.dwFileAttributes & FileAttributes.Directory) != 0 &&
                (fd.dwFileAttributes & FileAttributes.ReparsePoint) == 0 &&
                fd.cFileName != "." && fd.cFileName != "..")
            {
                directories.Add(System.IO.Path.Combine(path, fd.cFileName));
                dirs++;
            }
            // Otherwise, if this is a file ("archive"), increment the file count.
            else if ((fd.dwFileAttributes & FileAttributes.Archive) != 0)
            {
                files++;
            }
        }
        while (Win32.Kernel32.FindNextFile(hFile, ref fd));

        // Iterate over each discovered directory in parallel to maximize file/directory counting performance,
        // calling itself recursively to traverse each directory completely.
        Parallel.ForEach(
            directories,
            new ParallelOptions()
            {
                CancellationToken = token
            },
            directory =>
            {
                var count = CountFilesDirectories(
                    directory,
                    token
                    );

                lock (directories)
                {
                    files += count.Item1;
                    dirs += count.Item2;
                }
            });
    }
    catch (Exception)
    {
        // Handle as desired.
    }
    finally
    {
        if (hFile.ToInt64() != 0)
            Win32.Kernel32.FindClose(hFile);
    }

    return Tuple.Create<long, long>(files, dirs);
}

在我的本地系统上，GetFiles（）/GetDirectories（）的性能可以接近这个，但在较慢的连接（VPN等）中，我发现这要快得多-45分钟，而不是90秒访问一个包含约40 k文件、约40 GB大小的远程目录。
这也可以相当容易地修改，以包括其他数据，如所有文件的总文件大小计数，或快速递归并删除空目录，从最远的分支开始。

赞(0）回复(0）举报 2023-10-21

cgfeq70w5#

我会使用或基于这个多线程库：http://www.codeproject.com/KB/files/FileFind.aspx

赞(0）回复(0）举报 2023-10-21

enyaitl36#

试试这个（即）首先进行初始化，然后重用列表和directoryInfo对象）：

static List<Info> RecursiveMovieFolderScan1() {
      var info = new List<Info>();
      var dirInfo = new DirectoryInfo(path);
      RecursiveMovieFolderScan(dirInfo, info);
      return info;
  } 

  static List<Info> RecursiveMovieFolderScan(DirectoryInfo dirInfo, List<Info> info){

    foreach (var dir in dirInfo.GetDirectories()) {

        info.Add(new Info() {
            IsDirectory = true,
            CreatedDate = dir.CreationTimeUtc,
            ModifiedDate = dir.LastWriteTimeUtc,
            Path = dir.FullName
        });

        RecursiveMovieFolderScan(dir, info);
    }

    foreach (var file in dirInfo.GetFiles()) {
        info.Add(new Info()
        {
            IsDirectory = false,
            CreatedDate = file.CreationTimeUtc,
            ModifiedDate = file.LastWriteTimeUtc,
            Path = file.FullName
        });
    }

    return info; 
}

赞(0）回复(0）举报 2023-10-21

64jmpszr7#

最近我也有同样的问题，我觉得把所有的文件夹和文件输出到一个文本文件中也不错，然后用streamreader读取文本文件，用多线程做你想处理的事情。

cmd.exe /u /c dir "M:\" /s /b >"c:\flist1.txt"

[更新]嗨，莫比，你是对的。由于阅读回输出文本文件的开销，我的方法比较慢。实际上，我花了一些时间来测试顶部的答案和200万个文件。

The top answer: 2010100 files, time: 53023
cmd.exe method: 2010100 files, cmd time: 64907, scan output file time: 19832.

顶部答案方法（53023）比cmd.exe（64907）更快，更不用说如何提高阅读输出文本文件。虽然我本来的观点是提供一个不太差的答案，但是还是觉得不好意思哈。

赞(0）回复(0）举报 2023-10-21

pu82cl6c8#

这个实现，需要一点调整是5- 10倍快。

static List<Info> RecursiveScan2(string directory) {
        IntPtr INVALID_HANDLE_VALUE = new IntPtr(-1);
        WIN32_FIND_DATAW findData;
        IntPtr findHandle = INVALID_HANDLE_VALUE;

        var info = new List<Info>();
        try {
            findHandle = FindFirstFileW(directory + @"\*", out findData);
            if (findHandle != INVALID_HANDLE_VALUE) {

                do {
                    if (findData.cFileName == "." || findData.cFileName == "..") continue;

                    string fullpath = directory + (directory.EndsWith("\\") ? "" : "\\") + findData.cFileName;

                    bool isDir = false;

                    if ((findData.dwFileAttributes & FileAttributes.Directory) != 0) {
                        isDir = true;
                        info.AddRange(RecursiveScan2(fullpath));
                    }

                    info.Add(new Info()
                    {
                        CreatedDate = findData.ftCreationTime.ToDateTime(),
                        ModifiedDate = findData.ftLastWriteTime.ToDateTime(),
                        IsDirectory = isDir,
                        Path = fullpath
                    });
                }
                while (FindNextFile(findHandle, out findData));

            }
        } finally {
            if (findHandle != INVALID_HANDLE_VALUE) FindClose(findHandle);
        }
        return info;
    }

扩展方法：

public static class FILETIMEExtensions {
        public static DateTime ToDateTime(this System.Runtime.InteropServices.ComTypes.FILETIME filetime ) {
            long highBits = filetime.dwHighDateTime;
            highBits = highBits << 32;
            return DateTime.FromFileTimeUtc(highBits + (long)filetime.dwLowDateTime);
        }
    }

互操作定义是：

[DllImport("kernel32.dll", CharSet = CharSet.Unicode, SetLastError = true)]
    public static extern IntPtr FindFirstFileW(string lpFileName, out WIN32_FIND_DATAW lpFindFileData);

    [DllImport("kernel32.dll", CharSet = CharSet.Unicode)]
    public static extern bool FindNextFile(IntPtr hFindFile, out WIN32_FIND_DATAW lpFindFileData);

    [DllImport("kernel32.dll")]
    public static extern bool FindClose(IntPtr hFindFile);

    [StructLayout(LayoutKind.Sequential, CharSet = CharSet.Unicode)]
    public struct WIN32_FIND_DATAW {
        public FileAttributes dwFileAttributes;
        internal System.Runtime.InteropServices.ComTypes.FILETIME ftCreationTime;
        internal System.Runtime.InteropServices.ComTypes.FILETIME ftLastAccessTime;
        internal System.Runtime.InteropServices.ComTypes.FILETIME ftLastWriteTime;
        public int nFileSizeHigh;
        public int nFileSizeLow;
        public int dwReserved0;
        public int dwReserved1;
        [MarshalAs(UnmanagedType.ByValTStr, SizeConst = 260)]
        public string cFileName;
        [MarshalAs(UnmanagedType.ByValTStr, SizeConst = 14)]
        public string cAlternateFileName;
    }

赞(0）回复(0）举报 2023-10-21

nnvyjq4y9#

.NET文件枚举方法速度慢的历史由来已久。问题是没有一种即时的方法来枚举大型目录结构。即使是这里公认的答案也有GC分配的问题。
我所能做的最好的事情是 Package 在我的库中，并在CSharpTest.Net.IO namespace中公开为FileFile（source）类。此类可以枚举文件和文件夹，而无需进行不必要的GC分配和字符串封送处理。
用法很简单，RaiseOntagedDenied属性将跳过用户无权访问的目录和文件：

private static long SizeOf(string directory)
    {
        var fcounter = new CSharpTest.Net.IO.FindFile(directory, "*", true, true, true);
        fcounter.RaiseOnAccessDenied = false;

        long size = 0, total = 0;
        fcounter.FileFound +=
            (o, e) =>
            {
                if (!e.IsDirectory)
                {
                    Interlocked.Increment(ref total);
                    size += e.Length;
                }
            };

        Stopwatch sw = Stopwatch.StartNew();
        fcounter.Find();
        Console.WriteLine("Enumerated {0:n0} files totaling {1:n0} bytes in {2:n3} seconds.",
                          total, size, sw.Elapsed.TotalSeconds);
        return size;
    }

对于我的本地C：\ drive，它输出如下：
在232.876秒内枚举了810，046个文件，总计307，707，792，662字节。
您的里程可能因驱动器速度而异，但这是我发现的在托管代码中枚举文件的最快方法。event参数是一个FindFile.FileFoundEventArgs类型的可变类，因此请确保您没有保留对它的引用，因为它的值将随每个事件的引发而更改。
您可能还注意到，公开的DateTime仅以UTC表示。原因是转换为当地时间是半昂贵的。您可以考虑使用UTC时间来提高性能，而不是将其转换为本地时间。

赞(0）回复(0）举报 2023-10-21

我来回答

有没有更快的方法在.NET中递归地扫描目录？

9条答案

相关问题

热门标签

最新问答