fc fs筆記

應該會，具體細節我也不太懂…不過網上說的 ZOL 問題大多是 zfs 自己把 arc 限制得太小了，明明內存還有空很多…以及 zfs 的 arc 不作為 page cache 這一點在 Linux 和 BSD 和 Solaris 都是一樣，所以 mmap 的東西會有 page cache 一份和 arc 一份（

86 views15:46

fc fs筆記

Forwarded from farseerfc 😂

在 Linux 上，處於 low mem 的時候，mm 子系統自己有足夠信息和職責判斷是否 drop page cache 或者 swap out page ，不需要經過 vfs ，讓 mm 子系統的反應能很迅速。

85 views15:46

fc fs筆記

Forwarded from farseerfc 😂

傳統 Unix 尤其 FreeBSD 和 Windows 上，mm 的 dirty page 是 layered model ，back 在 vfs 層上面的，所以他們必須儘早 swap out 出去清掉 dirty flag 才能讓 mm 層在需要的時候 drop page ，於是 FreeBSD 和 Windows 的都要有個足夠大的 swap 保證性能

89 views15:46

fc fs筆記

Forwarded from farseerfc 😂

釋放的機制和 page cache 很不一樣就是了。Linux 的 mm 層在 low mem 的情況下不會去 call vfs ，所以 zfs 得去 hack … 這是 linus 當初做 unified page cache 的時候做的決定，linus 不相信 fs 能把 locking 做對（當時很多 fs 還在用 BKL），搞不對就很容易造成高負載下死鎖，linus 覺得 mm 層應該自己自主決定

92 views15:46

fc fs筆記

Forwarded from farseerfc 😂

https://utcc.utoronto.ca/~cks/space/blog/linux/ZFSOnLinuxARCShrinkage

91 views15:46

fc fs筆記

https://lore.kernel.org/linux-btrfs/cd123d15-cf94-69fd-5550-c18fd3bdaf5a@gmail.com/T/#mffbb5ea3769850e942ac298e2e342e55ac7e286b

85 views06:13

fc fs筆記

https://lore.kernel.org/linux-btrfs/cd123d15-cf94-69fd-5550-c18fd3bdaf5a@gmail.com/T/#mffbb5ea3769850e942ac298e2e342e55ac7e286b

How to use btrfs raid5 successfully(ish)
From: Zygo Blaxell @ 2020-06-27 3:24 UTC

To: linux-btrfs

Here are some guidelines for users running btrfs raid5 arrays to survive single-disk failures without losing all the data. Tested with kernel 5.4.41.

This list is intended for users. The developer version of this list (with references to detailed bug descriptions) is https://lore.kernel.org/linux-btrfs/20200627030614.GW10769@hungrycats.org/

Most of this advice applies to raid6 as well. btrfs raid5 is in such rough shape that I'm not bothering to test raid6 yet.

- never use raid5 for metadata.
Use raid1 for metadata (raid1c3 for raid6). raid5 metadata is vulnerable to multiple known bugs that can each prevent successful recovery from disk failure or cause unrecoverable filesystem damage.

- run scrubs often.
Scrub can repair corrupted data before it is permanently lost. Ordinary read and write operations on btrfs raid5 are not able to repair disk corruption in some cases.

- run scrubs on one disk at a time.
btrfs scrub is designed for mirrored and striped arrays. 'btrfs scrub' runs one kernel thread per disk, and that thread reads (and, when errors are detected and repair is possible, writes) to a single disk independently of all other disks. When 'btrfs scrub' is used for a raid5 array, it still runs a thread for each disk, but each thread reads data blocks from all disks in order to compute parity. This is a performance disaster, as every disk is read and written competitively by each thread.

To avoid these problems, run 'btrfs scrub start -B /dev/xxx' for each disk sequentially in the btrfs array, instead of 'btrfs scrub stat /mountpoint/filesystem'. This will run much faster.

- ignore spurious IO errors on reads while the filesystem is
degraded.
Due to a bug, the filesystem will report random spurious IO errors and csum failures on reads in raid5 degraded mode where no errors exist on disk. This affects normal read operations, btrfs balance, and device remove, but not 'btrfs replace'. Such errors should be ignored until 'btrfs replace' completes.

This bug does not appear to affect writes, but it will make some data that was recently written unreadable until the array exits degraded mode.

- device remove and balance will not be usable in degraded mode.
'device remove' and balance won't harm anything in degraded mode, but they will abort frequently due to the random spurious IO errors.

- when a disk fails, use 'btrfs replace' to replace it.
'btrfs replace' is currently the only reliable way to get a btrfs raid5 out of degraded mode.

If you plan to use spare drives, do not add them to the filesystem before a disk failure. You may not able to redistribute data from missing disks over existing disks with device remove. Keep spare disks empty and activate them using 'btrfs replace' as active disks fail.

- plan for the filesystem to be unusable during recovery.
There is currently no solution for reliable operation of applications using a filesystem with raid5 data during a disk failure. Data storage works to the extent I have been able to test it, but data retrieval is unreliable due to the spurious read error bug.

Shut down any applications using the filesystem at the time of disk failure, and keep them down until the failed disk is fully replaced.

- be prepared to reboot multiple times during disk replacement.
'btrfs replace' has some minor bugs that don't impact data, but do force kernel reboots due to hangs and stuck status flags. Replace will restart automatically after a reboot when the filesystem is mounted again.

- spurious IO errors and csum failures will disappear when
the filesystem is no longer in degraded mode, leaving only
real IO errors and csum failures.
Any read errors after btrfs replace is done (and maybe after an extra reboot to be sure replace is really done) are real data loss. Sorry.

90 views06:19

fc fs筆記

https://lore.kernel.org/linux-btrfs/cd123d15-cf94-69fd-5550-c18fd3bdaf5a@gmail.com/T/#mffbb5ea3769850e942ac298e2e342e55ac7e286b

- btrfs raid5 does not provide as complete protection against
on-disk data corruption as btrfs raid1 does.
When data corruption is present on disks (e.g. when a disk is temporarily disconnected and then reconnected), bugs in btrfs raid5 read and write code may fail to repair the corruption, resulting in permanent data loss.

btrfs raid5 is quantitatively more robust against data corruption than ext4+mdadm (which cannot self-repair corruption at all), but not as reliable as btrfs raid1 (which can self-repair all single-disk corruptions detectable by csum check).

- scrub and dev stats report data corruption on wrong devices
in raid5.
When there are csum failures, error counters of a random disk will be incremented, not necessarily the disk that contains the corrupted blocks. This makes it difficult or impossible to identify which disk in a raid5 array is corrupting data.

- scrub sometimes counts a csum error as a read error instead
on raid5.
Read and write errors are counted against the correct disk; however, there is some overlap in the read counter, which is a combination of true csum errors and false read failures.

- errors during readahead operations are repaired without
incrementing dev stats, discarding critical failure information.
This is not just a raid5 bug, it affects all btrfs profiles.

- what about write hole?
There is a write hole issue on btrfs raid5, but it occurs much less often than the other known issues, and the other issues affect much more data per failure event.

98 views06:19

fc fs筆記

https://www.youtube.com/watch?v=a2lnMxMUxyc Manfred Berger - HGST SMR - OpenZFS European Conference 2015

YouTube

Manfred Berger - HGST SMR - OpenZFS European Conference 2015

102 views04:51

fc fs筆記

https://stackoverflow.com/a/35256561

Stack Overflow

Is file append atomic in UNIX?

In general, what can we take for granted when we append to a file in UNIX from multiple processes? Is it possible to lose data (one process overwriting the other's changes)? Is it possible for da...

94 views16:24