summaryrefslogtreecommitdiff
path: root/sys/dev/md
AgeCommit message (Collapse)Author
8 daysvm_object: remove the charge memberKonstantin Belousov
State that the object charge is zero if object->cred == NULL, or equal to the ptoa(object->size) otherwise. Besides being much simpler, the transition to use object->size corrects the architectural issue with the use of object->charge. The split operations effectively carve the holes in the charged regions, but single counter cannot properly express it. As result, coalescing anonymous mappings cannot calculate correctly if the extended mapping already backed by the existing object is already accounted or not [1]. To properly solve the issue, either we need to start tracking exact charged regions in the anonymous objects, which has the significant overhead and complications. Or give up on the slight over-accounting and charge the whole object unconditionally, as it is done in the patch. Reported by: mmel, pho [1] Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D54572
2025-10-28sys/dev/md: cleanup includesKonstantin Belousov
Remove twice included but unneeded explicit sys/param.h. Sort. Sponsored by: The FreeBSD Foundation MFC after: 3 days
2025-07-22biosboot: Detect memory disks from PXERichard Russo
Walk through the disk driver entries chained off of INT13. MEMDISK is part of the Syslinux project; it loads disk images into memory, sets an int 13h hook and then does a BIOS boot from the image; this can be used as part of a PXE boot environment to load installer disks, however the disks are not accessible from inside the FreeBSD kernel because it doesn't access disks through BIOS APIs. This patch detects the disk images in the loader, and passes their address and length as a driver hint. When the md driver sees the hint, it maps the image, and presents it to the system. (rebased and reworked from https://reviews.freebsd.org/D27349) Feedback from: kib, bapt, olce Differential Revision: https://reviews.freebsd.org/D45404
2025-07-16md(4): Stop symlinking vn.4 to md.4Mateusz Piotrowski
We've done the same in the past to the vnconfig.8->mdconfig.8 link in: eb5f4569819 Remove ancient vnconfig symlink Reviewed by: bcr, markj, ziaee Approved by: markj (mentor) MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D27122
2025-07-03md: Restore guards in mddestroy()Mark Johnston
mddestroy() may be invoked on a partially constructed md device. Restore the guards that handled this prior to commit e91022168101. Reported by: syzbot+a0ff73f664de8757cfaa@syzkaller.appspotmail.com Reported by: syzbot+7b4a4824bf81548283ab@syzkaller.appspotmail.com Reviewed by: kib Fixes: e91022168101 ("md(4): move type-specific data under union") Differential Revision: https://reviews.freebsd.org/D51145
2025-07-02md(4): move type-specific data under unionKonstantin Belousov
This way it is clear which type uses which members. Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D51127
2025-06-25md: Use a larger buffer for the ident stringMark Johnston
With the old size, the string could easily be truncated, resulting in non-unique identifiers. PR: 287679 Reported by: Phil Krylov <phil@krylov.eu> Reviewed by: kib MFC after: 2 weeks
2024-11-21md: Fix linking of embedded filesystem images on aarch64John Baldwin
embedfs.S needs the right aarch64 features for BTI and/or PAC. Obtained from: CheriBSD Fixes: c2e0d56f5e49 ("arm64: Support BTI checking in most of the kernel") Sponsored by: AFRL, DARPA
2024-10-14md(4): always trim the last partial sectorKonstantin Belousov
Do it also for the preloaded disk, in addition to the dynamically configured device. This is needed to avoid geom checking alignment and panicing on read of the last sector, e.g. for partition schemes and label tasting. PR: 281978 Reported by: bz Reviewed by: bz, markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D47102
2024-06-01md: round-trip the MUSTDEALLOC and RESERVE optionsAlan Somers
If those options are requested when the device is created, ensure that they will be reported by MDIOCQUERY. MFC after: 2 weeks Reviewed by: imp Pull Request: https://github.com/freebsd/freebsd-src/pull/1270
2024-05-10md: Merge two switch statements in mdstart_vnodeJohn Baldwin
While here, use bp->bio_cmd instead of auio.uio_rw to drive read vs write behavior. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D45155
2024-04-30Fix new users of MAXPHYS and hide it from the kernel namespaceAndrew Gallatin
In cd8537910406, kib made maxphys a load-time tunable. This made the #define MAXPHYS in sys/param.h almost entirely obsolete, as it could now be overridden by kern.maxphys at boot time, or by opt_maxphys.h. However, decades of tradition have led to several new, incorrect, uses of MAXPHYS in other parts of the kernel, mostly by seasoned developers. I've corrected those uses here in a mechanical fashion, and verified that it fixes a bug in the md driver that I was experiencing. Since using MAXPHYS is such an easy mistake to make, it is best to hide it from the kernel namespace. So I've moved its definition to _maxphys.h, which is now included in param.h only for userspace. That brings up the fact that lots of userspace programs use MAXPHYS for different reasons, most of them probably wrong. Userspace consumers that really need to know the value of maxphys should probably be changed to use the kern.maxphys sysctl. But that's outside the scope of this change. Reviewed by: imp, jkim, kib, markj Fixes: 30038a8b4efc ("md: Get rid of the pbuf zone") Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D44986
2023-11-26sys: Remove ancient SCCS tags.Warner Losh
Remove ancient SCCS tags from the tree, automated scripting, with two minor fixup to keep things compiling. All the common forms in the tree were removed with a perl script. Sponsored by: Netflix
2023-08-16sys: Remove $FreeBSD$: two-line .h patternWarner Losh
Remove /^\s*\*\n \*\s+\$FreeBSD\$$\n/
2023-08-08md driver compat32: fix structure padding for arm, powerpcMike Karels
Because the 32-bit md_ioctl structure contains 64-bit members, arm and powerpc add padding to a multiple of 8. i386 doesn't do this. The md_ioctl32 definition was correct for amd64/i386 without padding, but wrong for arm64 and powerpc64. Make __packed__ conditional on __amd64__, and test for the expected size on non-amd64. Note that mdconfig is used in the ATF test suite. Note, I verified the structure size for powerpc, but was unable to test. MFC after: 1 week Reviewed by: jrtc27 Differential Revision: https://reviews.freebsd.org/D41339 Discussed with: jhibbits
2023-07-28Pre-quote macros passed to .incbin to avoid unwanted substitutionJessica Clarke
Currently for the MFS, firmware and VDSO template assembly files we pass the path to include with .incbin unquoted and use __XSTRING within the assembly file to stringify it. However, __XSTRING doesn't just perform a single level of expansion, it performs the normal full expansion of the macro, and so if the path itself happens to tokenise to something that includes a defined macro in it that will itself be substituted. For example, with #define MACRO 1, a path like /path/containing/MACRO/in/it will expand to /path/containing/1/in/it and then, when stringified, end up as "/path/containing/1/in/it", not the intended string. Normally, macros have names that start or end witih underscores and are unlikely to appear in a tokenised path (even if technically they could), but now that we've switched to GNU C as of commit ec41a96daaa6 ("sys: Switch the kernel's C standard from C99 to GNU99.") there are a few new macros defined which don't start or end with underscores: unix, which is always defined to 1, and i386, which is defined to 1 on i386. The former probably doesn't appear in user paths in practice, but the latter has been seen to and is likely quite common in the wild. Fix this by defining the macro pre-quoted instead of using __XSTRING. Note that technically we don't need to do this for vdso_wrap.S today as all the paths passed to it are safe file names with no user-controlled prefix but we should do it anyway for consistency and robustness against future changes. This allows make tinderbox to pass when built with source and object directories inside ~/path-with-unix, which would otherwise expand to ~/path-with-1 and break. PR: 272744 Fixes: ec41a96daaa6 ("sys: Switch the kernel's C standard from C99 to GNU99.")
2023-05-23md: Get rid of the pbuf zoneMark Johnston
The zone is used solely to provide KVA for mapping BIOs so that we can pass mapped buffers to VOP_READ and VOP_WRITE. Currently we preallocate nswbuf/10 bufs for this purpose during boot. The intent was to limit KVA usage on 32-bit systems, but the preallocation means that we in fact consumed more KVA than needed unless one has more than nswbuf/10 (typically 25) vnode-backed MD devices in existence, which I would argue is the uncommon case. Meanwhile, all I/O to an MD is handled by a dedicated thread, so we can instead simply preallocate the KVA region at MD device creation time. Event: BSDCan 2023 Reviewed by: kib MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D40215
2023-01-20md.c: another style fixKonstantin Belousov
Noted by: jkim Sponsored by: The FreeBSD Foundation MFC after: 3 days
2023-01-20Handle ERELOOKUP from VOP_FSYNC() in several other placesKonstantin Belousov
We need to repeat the operation if the vnode was relocked. Reported and reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D38114
2022-03-24vfs: NDFREE(&nd, NDF_ONLY_PNBUF) -> NDFREE_PNBUF(&nd)Mateusz Guzik
2022-02-17md(4): Add dummy support of the BIO_FLUSH command for malloc and swapAleksandr Fedorov
backend. PR: 260200 Reported by: editor@callfortesting.org Reviewed by: vmaffione (mentor), markj Approved by: vmaffione (mentor), markj Differential Revision: https://reviews.freebsd.org/D34260
2022-02-10Annotate geom_md with MODULE_VERSIONKyle Evans
This was missed in 74d6c131cbe2 where other geom modules were annotated with MODULE_VERSION. Again, the problem is the same: we can't detect that geom_md is loaded into the kernel without it. This was noticed in release builds on the cluster; mdconfig attempts to load geom_md because it can't detect it in the kernel, but the cluster config includes md(4) and does not build the kmod. This problem would have been masked on hosts with the kmod built, as the kmod attempts to register the g_md module and fails. With this commit, mdconfig would not even try to load it again. Reported by: re (cperciva) MFC after: 3 days
2021-11-25vfs: remove the unused thread argument from NDINIT*Mateusz Guzik
See b4a58fbf640409a1 ("vfs: remove cn_thread") Bump __FreeBSD_version to 1400043.
2021-09-11md: Add MD_MUSTDEALLOC supportKa Ho Ng
This adds an option to detect if hole-punching is implemented by the underlying file system. If this flag is set, and if the underlying file system does not support hole-punching, md(4) fails BIO_DELETE requests with EOPNOTSUPP. Sponsored by: The FreeBSD Foundation Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D31883
2021-08-31md: Clamp to a multiple of the sector size when resizingMark Johnston
We do this when creating md(4) devices, in kern_mdattach_locked(), but not when resizing the provider. Apply the same policy when resizing, as many GEOM classes do not expect to deal with providers for which pp->mediasize % pp->sectorsize != 0. Reported by: syzkaller MFC after: 1 week Sponsored by: The FreeBSD Foundation
2021-08-19md: Replace BIO_DELETE emulation with vn_deallocate(9)Ka Ho Ng
Both zero-filling and/or deallocation can be done with vn_deallocate(9). Sponsored by: The FreeBSD Foundation Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D28899
2021-03-30sys/dev/md: Drop unncessary __GLOBL(mfs_root)Alex Richardson
LLVM12 complains if you change the symbol binding: error: mfs_root_end changed binding to STB_WEAK [-Werror,-Winline-asm] error: mfs_root changed binding to STB_WEAK [-Werror,-Winline-asm]
2021-01-04md: Fix a race in mdstart_swap()Mark Johnston
Release a grabbed page's busy state only after marking it as referenced. Otherwise there exists a narrow window where the page could be freed before the update. Before r356902 this was not a problem since the object lock was held. Discussed with: kib Sponsored by: The FreeBSD Foundation
2020-12-27md: Set bio_completed properly in the face of errorsMark Johnston
Account for any residual bytes. This is only relevant for vnode-backed md(4) devices. Reviewed by: kib MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D27738
2020-12-23md: Fix a read-after-free in BIO_GETATTR handlingMark Johnston
g_handleattr_int() consumes the bio if the attribute matches, so when we check bp->bio_cmd bp may have been freed. Move GETATTR handling to a separate function to avoid the problem. We do not need to set bio_completed for such bios, g_handleattr_int() will handle it. Also remove the setting of bio_resid before the devstat_end_transaction_bio() call. All of the md(4) bio handlers set bio_resid already. Reported by: KASAN Reviewed by: kib MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D27724
2020-11-28Make MAXPHYS tunable. Bump MAXPHYS to 1M.Konstantin Belousov
Replace MAXPHYS by runtime variable maxphys. It is initialized from MAXPHYS by default, but can be also adjusted with the tunable kern.maxphys. Make b_pages[] array in struct buf flexible. Size b_pages[] for buffer cache buffers exactly to atop(maxbcachebuf) (currently it is sized to atop(MAXPHYS)), and b_pages[] for pbufs is sized to atop(maxphys) + 1. The +1 for pbufs allow several pbuf consumers, among them vmapbuf(), to use unaligned buffers still sized to maxphys, esp. when such buffers come from userspace (*). Overall, we save significant amount of otherwise wasted memory in b_pages[] for buffer cache buffers, while bumping MAXPHYS to desired high value. Eliminate all direct uses of the MAXPHYS constant in kernel and driver sources, except a place which initialize maxphys. Some random (and arguably weird) uses of MAXPHYS, e.g. in linuxolator, are converted straight. Some drivers, which use MAXPHYS to size embeded structures, get private MAXPHYS-like constant; their convertion is out of scope for this work. Changes to cam/, dev/ahci, dev/ata, dev/mpr, dev/mpt, dev/mvs, dev/siis, where either submitted by, or based on changes by mav. Suggested by: mav (*) Reviewed by: imp, mav, imp, mckusick, scottl (intermediate versions) Tested by: pho Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D27225 Notes: svn path=/head/; revision=368124
2020-11-12Fix a typo in a license commentMateusz Piotrowski
Approved by: kaktus (src) Notes: svn path=/head/; revision=367618
2020-10-20Use a template assembly file to generate the embedded MFS.John Baldwin
This uses the .incbin directive to pull in the MFS image contents. Using assembly directly ensures that symbols can be defined with the name and properties (such as .size) desired without having to rename symbols, etc. via a second objcopy invocation. Since it is compiled by the C compiler driver, it also avoids the need for all of the EMBEDFS* make variables. Suggested by: jrtc27 Reviewed by: kib, markj Obtained from: CheriBSD MFC after: 2 weeks Sponsored by: DARPA Differential Revision: https://reviews.freebsd.org/D26781 Notes: svn path=/head/; revision=366897
2020-09-01md: clean up empty lines in .c and .h filesMateusz Guzik
Notes: svn path=/head/; revision=365211
2020-06-28Remove some redundant assignments and computations.Mark Johnston
Reported by: alc Reviewed by: alc, kib Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D25400 Notes: svn path=/head/; revision=362739
2020-06-25Call swap_pager_freespace() from vm_object_page_remove().Mark Johnston
All vm_object_page_remove() callers, except linux_invalidate_mapping_pages() in the LinuxKPI, free swap space when removing a range of pages from an object. The LinuxKPI case appears to be an unintentional omission that could result in leaked swap blocks, so unconditionally free swap space in vm_object_page_remove() to protect against similar bugs in the future. Reviewed by: alc, kib Tested by: pho Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D25329 Notes: svn path=/head/; revision=362613
2020-02-28Convert a few triviail consumers to the new unlocked grab API.Jeff Roberson
Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D23847 Notes: svn path=/head/; revision=358447
2020-01-19Don't hold the object lock while calling getpages.Jeff Roberson
The vnode pager does not want the object lock held. Moving this out allows further object lock scope reduction in callers. While here add some missing paging in progress calls and an assert. The object handle is now protected explicitly with pip. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D23033 Notes: svn path=/head/; revision=356902
2020-01-03vfs: drop the mostly unused flags argument from VOP_UNLOCKMateusz Guzik
Filesystems which want to use it in limited capacity can employ the VOP_UNLOCK_FLAGS macro. Reviewed by: kib (previous version) Differential Revision: https://reviews.freebsd.org/D21427 Notes: svn path=/head/; revision=356337
2020-01-03Fix a page leak in the md(4) swap I/O path.Mark Johnston
r356147 removed a vm_page_activate() call, but this is required to ensure that pages end up in the page queues in the first place. Restore the pre-r356157 logic. Now, without the page lock, the vm_page_active() check is racy, but this race is harmless. Reviewed by: alc, kib Reported and tested by: pho Differential Revision: https://reviews.freebsd.org/D23024 Notes: svn path=/head/; revision=356326
2020-01-03Avoid duplicate I/O statistics accounting.Alexander Motin
Alike to geom_disk free the provider statistics structure and point GEOM toward local statistics. It allows to save some CPU time. MFC after: 2 weeks Notes: svn path=/head/; revision=356315
2019-12-30Use atomic for start_count in devstat_start_transaction().Alexander Motin
Combined with earlier nstart/nend removal it allows to remove several locks from request path of GEOM and few other places. It would be cool if we had more SMP-friendly statistics, but this helps too. Sponsored by: iXsystems, Inc. Notes: svn path=/head/; revision=356200
2019-12-28Remove page locking for queue operations.Mark Johnston
With the previous reviews, the page lock is no longer required in order to perform queue operations on a page. It is also no longer needed in the page queue scans. This change effectively eliminates remaining uses of the page lock and also the false sharing caused by multiple pages sharing a page lock. Reviewed by: jeff Tested by: pho Sponsored by: Netflix, Intel Differential Revision: https://reviews.freebsd.org/D22885 Notes: svn path=/head/; revision=356157
2019-12-15Add a deferred free mechanism for freeing swap space that does not requireJeff Roberson
an exclusive object lock. Previously swap space was freed on a best effort basis when a page that had valid swap was dirtied, thus invalidating the swap copy. This may be done inconsistently and requires the object lock which is not always convenient. Instead, track when swap space is present. The first dirty is responsible for deleting space or setting PGA_SWAP_FREE which will trigger background scans to free the swap space. Simplify the locking in vm_fault_dirty() now that we can reliably identify the first dirty. Discussed with: alc, kib, markj Differential Revision: https://reviews.freebsd.org/D22654 Notes: svn path=/head/; revision=355765
2019-12-08vfs: introduce v_irflag and make v_type smallerMateusz Guzik
The current vnode layout is not smp-friendly by having frequently read data avoidably sharing cachelines with very frequently modified fields. In particular v_iflag inspected for VI_DOOMED can be found in the same line with v_usecount. Instead make it available in the same cacheline as the v_op, v_data and v_type which all get read all the time. v_type is avoidably 4 bytes while the necessary data will easily fit in 1. Shrinking it frees up 3 bytes, 2 of which get used here to introduce a new flag field with a new value: VIRF_DOOMED. Reviewed by: kib, jeff Differential Revision: https://reviews.freebsd.org/D22715 Notes: svn path=/head/; revision=355537
2019-12-02Fix a few places that free a page from an object without busy held. This isJeff Roberson
tightening constraints on busy as a precursor to lockless page lookup and should largely be a NOP for these cases. Reviewed by: alc, kib, markj Differential Revision: https://reviews.freebsd.org/D22611 Notes: svn path=/head/; revision=355314
2019-10-15(4/6) Protect page valid with the busy lock.Jeff Roberson
Atomics are used for page busy and valid state when the shared busy is held. The details of the locking protocol and valid and dirty synchronization are in the updated vm_page.h comments. Reviewed by: kib, markj Tested by: pho Sponsored by: Netflix, Intel Differential Revision: https://reviews.freebsd.org/D21594 Notes: svn path=/head/; revision=353539
2019-09-09Change synchonization rules for vm_page reference counting.Mark Johnston
There are several mechanisms by which a vm_page reference is held, preventing the page from being freed back to the page allocator. In particular, holding the page's object lock is sufficient to prevent the page from being freed; holding the busy lock or a wiring is sufficent as well. These references are protected by the page lock, which must therefore be acquired for many per-page operations. This results in false sharing since the page locks are external to the vm_page structures themselves and each lock protects multiple structures. Transition to using an atomically updated per-page reference counter. The object's reference is counted using a flag bit in the counter. A second flag bit is used to atomically block new references via pmap_extract_and_hold() while removing managed mappings of a page. Thus, the reference count of a page is guaranteed not to increase if the page is unbusied, unmapped, and the object's write lock is held. As a consequence of this, the page lock no longer protects a page's identity; operations which move pages between objects are now synchronized solely by the objects' locks. The vm_page_wire() and vm_page_unwire() KPIs are changed. The former requires that either the object lock or the busy lock is held. The latter no longer has a return value and may free the page if it releases the last reference to that page. vm_page_unwire_noq() behaves the same as before; the caller is responsible for checking its return value and freeing or enqueuing the page as appropriate. vm_page_wire_mapped() is introduced for use in pmap_extract_and_hold(). It fails if the page is concurrently being unmapped, typically triggering a fallback to the fault handler. vm_page_wire() no longer requires the page lock and vm_page_unwire() now internally acquires the page lock when releasing the last wiring of a page (since the page lock still protects a page's queue state). In particular, synchronization details are no longer leaked into the caller. The change excises the page lock from several frequently executed code paths. In particular, vm_object_terminate() no longer bounces between page locks as it releases an object's pages, and direct I/O and sendfile(SF_NOCACHE) completions no longer require the page lock. In these latter cases we now get linear scalability in the common scenario where different threads are operating on different files. __FreeBSD_version is bumped. The DRM ports have been updated to accomodate the KPI changes. Reviewed by: jeff (earlier version) Tested by: gallatin (earlier version), pho Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20486 Notes: svn path=/head/; revision=352110
2019-08-16md(4): remove the unused and unusable MDIOCLIST ioctl.Brooks Davis
It is unused, the ABI was broken in r322969, and it is broken by design (more than MDNPAD md devices can exist and there is no way to retreive them with this interface). mdconfig(8) was converted to use libgeom to obtain this information in r157160 and any other consumers of MDIOCLIST should likewise be converted. Reviewed by: emaste Relnotes: yes Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D18936 Notes: svn path=/head/; revision=351132
2019-03-31When using the force option to shut down a memory-disk device,Kirk McKusick
I/O operations already in its queue were not being properly drained. The GEOM framework does the queue draining, but the device driver needs to wait for the draining to happen. The waiting is done by adding a g_md_providergone() function to wait for the I/O operations to finish up. It is likely that every GEOM provider that implements orphaning attached GEOM consumers needs to use the "providergone" mechanism for this same reason, but some of them do not do so. Apparently Kenneth Merry (ken@) added the drain for just such races, but he missed adding it to some of the device drivers that needed it. Submitted by: Chuck Silvers Reviewed by: imp Tested by: Chuck Silvers MFC after: 1 week Sponsored by: Netflix Notes: svn path=/head/; revision=345758