blob: b5cb495c0bf3c8fca0063d58f2ec909356b2e470 [file] [log] [blame] [view]
Nikolay Vitkov95ea0452025-04-30 15:52:341# path-scurry
2
3Extremely high performant utility for building tools that read
4the file system, minimizing filesystem and path string munging
5operations to the greatest degree possible.
6
7## Ugh, yet another file traversal thing on npm?
8
9Yes. None of the existing ones gave me exactly what I wanted.
10
11## Well what is it you wanted?
12
13While working on [glob](http://npm.im/glob), I found that I
14needed a module to very efficiently manage the traversal over a
15folder tree, such that:
16
171. No `readdir()` or `stat()` would ever be called on the same
18 file or directory more than one time.
192. No `readdir()` calls would be made if we can be reasonably
20 sure that the path is not a directory. (Ie, a previous
21 `readdir()` or `stat()` covered the path, and
22 `ent.isDirectory()` is false.)
233. `path.resolve()`, `dirname()`, `basename()`, and other
24 string-parsing/munging operations are be minimized. This means
25 it has to track "provisional" child nodes that may not exist
26 (and if we find that they _don't_ exist, store that
27 information as well, so we don't have to ever check again).
284. The API is not limited to use as a stream/iterator/etc. There
29 are many cases where an API like node's `fs` is preferrable.
305. It's more important to prevent excess syscalls than to be up
31 to date, but it should be smart enough to know what it
32 _doesn't_ know, and go get it seamlessly when requested.
336. Do not blow up the JS heap allocation if operating on a
34 directory with a huge number of entries.
357. Handle all the weird aspects of Windows paths, like UNC paths
36 and drive letters and wrongway slashes, so that the consumer
37 can return canonical platform-specific paths without having to
38 parse or join or do any error-prone string munging.
39
40## PERFORMANCE
41
42JavaScript people throw around the word "blazing" a lot. I hope
43that this module doesn't blaze anyone. But it does go very fast,
44in the cases it's optimized for, if used properly.
45
46PathScurry provides ample opportunities to get extremely good
47performance, as well as several options to trade performance for
48convenience.
49
50Benchmarks can be run by executing `npm run bench`.
51
52As is always the case, doing more means going slower, doing less
53means going faster, and there are trade offs between speed and
54memory usage.
55
56PathScurry makes heavy use of [LRUCache](http://npm.im/lru-cache)
57to efficiently cache whatever it can, and `Path` objects remain
58in the graph for the lifetime of the walker, so repeated calls
59with a single PathScurry object will be extremely fast. However,
60adding items to a cold cache means "doing more", so in those
61cases, we pay a price. Nothing is free, but every effort has been
62made to reduce costs wherever possible.
63
64Also, note that a "cache as long as possible" approach means that
65changes to the filesystem may not be reflected in the results of
66repeated PathScurry operations.
67
68For resolving string paths, `PathScurry` ranges from 5-50 times
69faster than `path.resolve` on repeated resolutions, but around
70100 to 1000 times _slower_ on the first resolution. If your
71program is spending a lot of time resolving the _same_ paths
72repeatedly (like, thousands or millions of times), then this can
73be beneficial. But both implementations are pretty fast, and
74speeding up an infrequent operation from 4µs to 400ns is not
75going to move the needle on your app's performance.
76
77For walking file system directory trees, a lot depends on how
78often a given PathScurry object will be used, and also on the
79walk method used.
80
81With default settings on a folder tree of 100,000 items,
82consisting of around a 10-to-1 ratio of normal files to
83directories, PathScurry performs comparably to
84[@nodelib/fs.walk](http://npm.im/@nodelib/fs.walk), which is the
85fastest and most reliable file system walker I could find. As far
86as I can tell, it's almost impossible to go much faster in a
87Node.js program, just based on how fast you can push syscalls out
88to the fs thread pool.
89
90On my machine, that is about 1000-1200 completed walks per second
91for async or stream walks, and around 500-600 walks per second
92synchronously.
93
94In the warm cache state, PathScurry's performance increases
95around 4x for async `for await` iteration, 10-15x faster for
96streams and synchronous `for of` iteration, and anywhere from 30x
97to 80x faster for the rest.
98
99```
100# walk 100,000 fs entries, 10/1 file/dir ratio
101# operations / ms
102 New PathScurry object | Reuse PathScurry object
103 stream: 1112.589 | 13974.917
104sync stream: 492.718 | 15028.343
105 async walk: 1095.648 | 32706.395
106 sync walk: 527.632 | 46129.772
107 async iter: 1288.821 | 5045.510
108 sync iter: 498.496 | 17920.746
109```
110
111A hand-rolled walk calling `entry.readdir()` and recursing
112through the entries can benefit even more from caching, with
113greater flexibility and without the overhead of streams or
114generators.
115
116The cold cache state is still limited by the costs of file system
117operations, but with a warm cache, the only bottleneck is CPU
118speed and VM optimizations. Of course, in that case, some care
119must be taken to ensure that you don't lose performance as a
120result of silly mistakes, like calling `readdir()` on entries
121that you know are not directories.
122
123```
124# manual recursive iteration functions
125 cold cache | warm cache
126async: 1164.901 | 17923.320
127 cb: 1101.127 | 40999.344
128zalgo: 1082.240 | 66689.936
129 sync: 526.935 | 87097.591
130```
131
132In this case, the speed improves by around 10-20x in the async
133case, 40x in the case of using `entry.readdirCB` with protections
134against synchronous callbacks, and 50-100x with callback
135deferrals disabled, and _several hundred times faster_ for
136synchronous iteration.
137
138If you can think of a case that is not covered in these
139benchmarks, or an implementation that performs significantly
140better than PathScurry, please [let me
141know](https://github.com/isaacs/path-scurry/issues).
142
143## USAGE
144
145```ts
146// hybrid module, load with either method
147import { PathScurry, Path } from 'path-scurry'
148// or:
149const { PathScurry, Path } = require('path-scurry')
150
151// very simple example, say we want to find and
152// delete all the .DS_Store files in a given path
153// note that the API is very similar to just a
154// naive walk with fs.readdir()
155import { unlink } from 'fs/promises'
156
157// easy way, iterate over the directory and do the thing
158const pw = new PathScurry(process.cwd())
159for await (const entry of pw) {
160 if (entry.isFile() && entry.name === '.DS_Store') {
161 unlink(entry.fullpath())
162 }
163}
164
165// here it is as a manual recursive method
166const walk = async (entry: Path) => {
167 const promises: Promise<any> = []
168 // readdir doesn't throw on non-directories, it just doesn't
169 // return any entries, to save stack trace costs.
170 // Items are returned in arbitrary unsorted order
171 for (const child of await pw.readdir(entry)) {
172 // each child is a Path object
173 if (child.name === '.DS_Store' && child.isFile()) {
174 // could also do pw.resolve(entry, child.name),
175 // just like fs.readdir walking, but .fullpath is
176 // a *slightly* more efficient shorthand.
177 promises.push(unlink(child.fullpath()))
178 } else if (child.isDirectory()) {
179 promises.push(walk(child))
180 }
181 }
182 return Promise.all(promises)
183}
184
185walk(pw.cwd).then(() => {
186 console.log('all .DS_Store files removed')
187})
188
189const pw2 = new PathScurry('/a/b/c') // pw2.cwd is the Path for /a/b/c
190const relativeDir = pw2.cwd.resolve('../x') // Path entry for '/a/b/x'
191const relative2 = pw2.cwd.resolve('/a/b/d/../x') // same path, same entry
192assert.equal(relativeDir, relative2)
193```
194
195## API
196
197[Full TypeDoc API](https://isaacs.github.io/path-scurry)
198
199There are platform-specific classes exported, but for the most
200part, the default `PathScurry` and `Path` exports are what you
201most likely need, unless you are testing behavior for other
202platforms.
203
204Intended public API is documented here, but the full
205documentation does include internal types, which should not be
206accessed directly.
207
208### Interface `PathScurryOpts`
209
210The type of the `options` argument passed to the `PathScurry`
211constructor.
212
213- `nocase`: Boolean indicating that file names should be compared
214 case-insensitively. Defaults to `true` on darwin and win32
215 implementations, `false` elsewhere.
216
217 **Warning** Performing case-insensitive matching on a
218 case-sensitive filesystem will result in occasionally very
219 bizarre behavior. Performing case-sensitive matching on a
220 case-insensitive filesystem may negatively impact performance.
221
222- `childrenCacheSize`: Number of child entries to cache, in order
223 to speed up `resolve()` and `readdir()` calls. Defaults to
224 `16 * 1024` (ie, `16384`).
225
226 Setting it to a higher value will run the risk of JS heap
227 allocation errors on large directory trees. Setting it to `256`
228 or smaller will significantly reduce the construction time and
229 data consumption overhead, but with the downside of operations
230 being slower on large directory trees. Setting it to `0` will
231 mean that effectively no operations are cached, and this module
232 will be roughly the same speed as `fs` for file system
233 operations, and _much_ slower than `path.resolve()` for
234 repeated path resolution.
235
236- `fs` An object that will be used to override the default `fs`
237 methods. Any methods that are not overridden will use Node's
238 built-in implementations.
239
240 - lstatSync
241 - readdir (callback `withFileTypes` Dirent variant, used for
242 readdirCB and most walks)
243 - readdirSync
244 - readlinkSync
245 - realpathSync
246 - promises: Object containing the following async methods:
247 - lstat
248 - readdir (Dirent variant only)
249 - readlink
250 - realpath
251
252### Interface `WalkOptions`
253
254The options object that may be passed to all walk methods.
255
256- `withFileTypes`: Boolean, default true. Indicates that `Path`
257 objects should be returned. Set to `false` to get string paths
258 instead.
259- `follow`: Boolean, default false. Attempt to read directory
260 entries from symbolic links. Otherwise, only actual directories
261 are traversed. Regardless of this setting, a given target path
262 will only ever be walked once, meaning that a symbolic link to
263 a previously traversed directory will never be followed.
264
265 Setting this imposes a slight performance penalty, because
266 `readlink` must be called on all symbolic links encountered, in
267 order to avoid infinite cycles.
268
269- `filter`: Function `(entry: Path) => boolean`. If provided,
270 will prevent the inclusion of any entry for which it returns a
271 falsey value. This will not prevent directories from being
272 traversed if they do not pass the filter, though it will
273 prevent the directories themselves from being included in the
274 results. By default, if no filter is provided, then all entries
275 are included in the results.
276- `walkFilter`: Function `(entry: Path) => boolean`. If provided,
277 will prevent the traversal of any directory (or in the case of
278 `follow:true` symbolic links to directories) for which the
279 function returns false. This will not prevent the directories
280 themselves from being included in the result set. Use `filter`
281 for that.
282
283Note that TypeScript return types will only be inferred properly
284from static analysis if the `withFileTypes` option is omitted, or
285a constant `true` or `false` value.
286
287### Class `PathScurry`
288
289The main interface. Defaults to an appropriate class based on the
290current platform.
291
292Use `PathScurryWin32`, `PathScurryDarwin`, or `PathScurryPosix`
293if implementation-specific behavior is desired.
294
295All walk methods may be called with a `WalkOptions` argument to
296walk over the object's current working directory with the
297supplied options.
298
299#### `async pw.walk(entry?: string | Path | WalkOptions, opts?: WalkOptions)`
300
301Walk the directory tree according to the options provided,
302resolving to an array of all entries found.
303
304#### `pw.walkSync(entry?: string | Path | WalkOptions, opts?: WalkOptions)`
305
306Walk the directory tree according to the options provided,
307returning an array of all entries found.
308
309#### `pw.iterate(entry?: string | Path | WalkOptions, opts?: WalkOptions)`
310
311Iterate over the directory asynchronously, for use with `for
312await of`. This is also the default async iterator method.
313
314#### `pw.iterateSync(entry?: string | Path | WalkOptions, opts?: WalkOptions)`
315
316Iterate over the directory synchronously, for use with `for of`.
317This is also the default sync iterator method.
318
319#### `pw.stream(entry?: string | Path | WalkOptions, opts?: WalkOptions)`
320
321Return a [Minipass](http://npm.im/minipass) stream that emits
322each entry or path string in the walk. Results are made available
323asynchronously.
324
325#### `pw.streamSync(entry?: string | Path | WalkOptions, opts?: WalkOptions)`
326
327Return a [Minipass](http://npm.im/minipass) stream that emits
328each entry or path string in the walk. Results are made available
329synchronously, meaning that the walk will complete in a single
330tick if the stream is fully consumed.
331
332#### `pw.cwd`
333
334Path object representing the current working directory for the
335PathScurry.
336
337#### `pw.chdir(path: string)`
338
339Set the new effective current working directory for the scurry
340object, so that `path.relative()` and `path.relativePosix()`
341return values relative to the new cwd path.
342
343#### `pw.depth(path?: Path | string): number`
344
345Return the depth of the specified path (or the PathScurry cwd)
346within the directory tree.
347
348Root entries have a depth of `0`.
349
350#### `pw.resolve(...paths: string[])`
351
352Caching `path.resolve()`.
353
354Significantly faster than `path.resolve()` if called repeatedly
355with the same paths. Significantly slower otherwise, as it builds
356out the cached Path entries.
357
358To get a `Path` object resolved from the `PathScurry`, use
359`pw.cwd.resolve(path)`. Note that `Path.resolve` only takes a
360single string argument, not multiple.
361
362#### `pw.resolvePosix(...paths: string[])`
363
364Caching `path.resolve()`, but always using posix style paths.
365
366This is identical to `pw.resolve(...paths)` on posix systems (ie,
367everywhere except Windows).
368
369On Windows, it returns the full absolute UNC path using `/`
370separators. Ie, instead of `'C:\\foo\\bar`, it would return
371`//?/C:/foo/bar`.
372
373#### `pw.relative(path: string | Path): string`
374
375Return the relative path from the PathWalker cwd to the supplied
376path string or entry.
377
378If the nearest common ancestor is the root, then an absolute path
379is returned.
380
381#### `pw.relativePosix(path: string | Path): string`
382
383Return the relative path from the PathWalker cwd to the supplied
384path string or entry, using `/` path separators.
385
386If the nearest common ancestor is the root, then an absolute path
387is returned.
388
389On posix platforms (ie, all platforms except Windows), this is
390identical to `pw.relative(path)`.
391
392On Windows systems, it returns the resulting string as a
393`/`-delimited path. If an absolute path is returned (because the
394target does not share a common ancestor with `pw.cwd`), then a
395full absolute UNC path will be returned. Ie, instead of
396`'C:\\foo\\bar`, it would return `//?/C:/foo/bar`.
397
398#### `pw.basename(path: string | Path): string`
399
400Return the basename of the provided string or Path.
401
402#### `pw.dirname(path: string | Path): string`
403
404Return the parent directory of the supplied string or Path.
405
406#### `async pw.readdir(dir = pw.cwd, opts = { withFileTypes: true })`
407
408Read the directory and resolve to an array of strings if
409`withFileTypes` is explicitly set to `false` or Path objects
410otherwise.
411
412Can be called as `pw.readdir({ withFileTypes: boolean })` as
413well.
414
415Returns `[]` if no entries are found, or if any error occurs.
416
417Note that TypeScript return types will only be inferred properly
418from static analysis if the `withFileTypes` option is omitted, or
419a constant `true` or `false` value.
420
421#### `pw.readdirSync(dir = pw.cwd, opts = { withFileTypes: true })`
422
423Synchronous `pw.readdir()`
424
425#### `async pw.readlink(link = pw.cwd, opts = { withFileTypes: false })`
426
427Call `fs.readlink` on the supplied string or Path object, and
428return the result.
429
430Can be called as `pw.readlink({ withFileTypes: boolean })` as
431well.
432
433Returns `undefined` if any error occurs (for example, if the
434argument is not a symbolic link), or a `Path` object if
435`withFileTypes` is explicitly set to `true`, or a string
436otherwise.
437
438Note that TypeScript return types will only be inferred properly
439from static analysis if the `withFileTypes` option is omitted, or
440a constant `true` or `false` value.
441
442#### `pw.readlinkSync(link = pw.cwd, opts = { withFileTypes: false })`
443
444Synchronous `pw.readlink()`
445
446#### `async pw.lstat(entry = pw.cwd)`
447
448Call `fs.lstat` on the supplied string or Path object, and fill
449in as much information as possible, returning the updated `Path`
450object.
451
452Returns `undefined` if the entry does not exist, or if any error
453is encountered.
454
455Note that some `Stats` data (such as `ino`, `dev`, and `mode`)
456will not be supplied. For those things, you'll need to call
457`fs.lstat` yourself.
458
459#### `pw.lstatSync(entry = pw.cwd)`
460
461Synchronous `pw.lstat()`
462
463#### `pw.realpath(entry = pw.cwd, opts = { withFileTypes: false })`
464
465Call `fs.realpath` on the supplied string or Path object, and
466return the realpath if available.
467
468Returns `undefined` if any error occurs.
469
470May be called as `pw.realpath({ withFileTypes: boolean })` to run
471on `pw.cwd`.
472
473#### `pw.realpathSync(entry = pw.cwd, opts = { withFileTypes: false })`
474
475Synchronous `pw.realpath()`
476
477### Class `Path` implements [fs.Dirent](https://nodejs.org/docs/latest/api/fs.html#class-fsdirent)
478
479Object representing a given path on the filesystem, which may or
480may not exist.
481
482Note that the actual class in use will be either `PathWin32` or
483`PathPosix`, depending on the implementation of `PathScurry` in
484use. They differ in the separators used to split and join path
485strings, and the handling of root paths.
486
487In `PathPosix` implementations, paths are split and joined using
488the `'/'` character, and `'/'` is the only root path ever in use.
489
490In `PathWin32` implementations, paths are split using either
491`'/'` or `'\\'` and joined using `'\\'`, and multiple roots may
492be in use based on the drives and UNC paths encountered. UNC
493paths such as `//?/C:/` that identify a drive letter, will be
494treated as an alias for the same root entry as their associated
495drive letter (in this case `'C:\\'`).
496
497#### `path.name`
498
499Name of this file system entry.
500
501**Important**: _always_ test the path name against any test
502string using the `isNamed` method, and not by directly comparing
503this string. Otherwise, unicode path strings that the system sees
504as identical will not be properly treated as the same path,
505leading to incorrect behavior and possible security issues.
506
507#### `path.isNamed(name: string): boolean`
508
509Return true if the path is a match for the given path name. This
510handles case sensitivity and unicode normalization.
511
512Note: even on case-sensitive systems, it is **not** safe to test
513the equality of the `.name` property to determine whether a given
514pathname matches, due to unicode normalization mismatches.
515
516Always use this method instead of testing the `path.name`
517property directly.
518
519#### `path.isCWD`
520
521Set to true if this `Path` object is the current working
522directory of the `PathScurry` collection that contains it.
523
524#### `path.getType()`
525
526Returns the type of the Path object, `'File'`, `'Directory'`,
527etc.
528
529#### `path.isType(t: type)`
530
531Returns true if `is{t}()` returns true.
532
533For example, `path.isType('Directory')` is equivalent to
534`path.isDirectory()`.
535
536#### `path.depth()`
537
538Return the depth of the Path entry within the directory tree.
539Root paths have a depth of `0`.
540
541#### `path.fullpath()`
542
543The fully resolved path to the entry.
544
545#### `path.fullpathPosix()`
546
547The fully resolved path to the entry, using `/` separators.
548
549On posix systems, this is identical to `path.fullpath()`. On
550windows, this will return a fully resolved absolute UNC path
551using `/` separators. Eg, instead of `'C:\\foo\\bar'`, it will
552return `'//?/C:/foo/bar'`.
553
554#### `path.isFile()`, `path.isDirectory()`, etc.
555
556Same as the identical `fs.Dirent.isX()` methods.
557
558#### `path.isUnknown()`
559
560Returns true if the path's type is unknown. Always returns true
561when the path is known to not exist.
562
563#### `path.resolve(p: string)`
564
565Return a `Path` object associated with the provided path string
566as resolved from the current Path object.
567
568#### `path.relative(): string`
569
570Return the relative path from the PathWalker cwd to the supplied
571path string or entry.
572
573If the nearest common ancestor is the root, then an absolute path
574is returned.
575
576#### `path.relativePosix(): string`
577
578Return the relative path from the PathWalker cwd to the supplied
579path string or entry, using `/` path separators.
580
581If the nearest common ancestor is the root, then an absolute path
582is returned.
583
584On posix platforms (ie, all platforms except Windows), this is
585identical to `pw.relative(path)`.
586
587On Windows systems, it returns the resulting string as a
588`/`-delimited path. If an absolute path is returned (because the
589target does not share a common ancestor with `pw.cwd`), then a
590full absolute UNC path will be returned. Ie, instead of
591`'C:\\foo\\bar`, it would return `//?/C:/foo/bar`.
592
593#### `async path.readdir()`
594
595Return an array of `Path` objects found by reading the associated
596path entry.
597
598If path is not a directory, or if any error occurs, returns `[]`,
599and marks all children as provisional and non-existent.
600
601#### `path.readdirSync()`
602
603Synchronous `path.readdir()`
604
605#### `async path.readlink()`
606
607Return the `Path` object referenced by the `path` as a symbolic
608link.
609
610If the `path` is not a symbolic link, or any error occurs,
611returns `undefined`.
612
613#### `path.readlinkSync()`
614
615Synchronous `path.readlink()`
616
617#### `async path.lstat()`
618
619Call `lstat` on the path object, and fill it in with details
620determined.
621
622If path does not exist, or any other error occurs, returns
623`undefined`, and marks the path as "unknown" type.
624
625#### `path.lstatSync()`
626
627Synchronous `path.lstat()`
628
629#### `async path.realpath()`
630
631Call `realpath` on the path, and return a Path object
632corresponding to the result, or `undefined` if any error occurs.
633
634#### `path.realpathSync()`
635
636Synchornous `path.realpath()`