So you got an idea, or better somebody made you, migrate from a cosy EMC NFS server to a Windows Failover Cluster with File Server role serving NFS to Linux clients.
All fine and great… Until it's not.
When you see this for the first time, I bet you'll feel lost too:
www-data@docker9:/mnt/nfs$ chown www-data testfile chown: changing ownership of 'testfile': Permission denied
Let's look at the issues we had doing this.
Until recently we ran an EMC VNX 5200 with redundant Data Movers that served NFS and (although unused) CIFS shares. The performance was decent, we had just one outage but even that got mitigated right away as its was just one DM rebooting itself.
The old EMC got decommissioned, we got a brand new one, but it had a different purpose. We moved all the virtual servers that were previous on EMC (Fiber Channel connected) to a Microsoft Failover Cluster with Storage Spaces Direct. It's a great thing, with killer IOPS performance so an “executive” decision was made to also move that NFS share we needed to the same platform. After all, IOPS!
I won't go into installation, but we created two Windows Server 2016 VMs, one Shared Drive - VHD Set and configured a File Server role.
First problem was the NFS access permissions.
In a classis AUTH_SYS environment you can't set permissions to IP wildcards or CIDR! With 10s of hosts and multiple subnets managing this is a shot in the foot itself. But for performance, we were ready take this one.
Then Case sensitivity.
Since backed by NTFS, files were case insensitive, and we could collide with paths like
/test/Test. Powers that be insisted this was bad system design this is happening, but luckily the collisions we had were few and irrelevant.
I want to note that there is a solution for this with
fsutil.exe file setCaseSensitiveInfo C:\folder enable but our installation was too old (Windows Server 2016) to install the required WSL feature.
Note: Apparently this could also be used:
reg add "HKLM\SYSTEM\CurrentControlSet\Control\Session Manager\kernel" /v obcaseinsensitive /t REG_DWORD /d 0.
Next, transfer performance.
Initially our test hosts mounted the shares as NFSv3, because NFSv4 just didn't work. File transfer performance was order of magnitude (20x or worse) slower. This was mitigated by actually reading the fine print and realizing that Windows NFS does not support protocol 4.0 but does 4.1.
Permission denied on chown
So, with decent performance we bounced into one last issue. The
Permission denied when a user chowns a file they already own, to their own user. Well, it wasn't that obvious. The problem showed itself when moving a file - in a web server - in docker… Finally reproduced the issue in shell and then with Wireshark figured it has to do with
mv issuing a
chown in the background.
bash-5.0$ mv /tmp/testfile /mnt/nfs/tmp/ mv: can't preserve ownership of '/mnt/nfs/tmp/testfile': Permission denied
This flag should be in registry located in
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\ServerForNfs\CurrentVersion\Exports\<No>/Restrictchown. However, in our clustered environment, there was no such key under
I have not found a solution anywhere and then found on my own, that in clustered environment, a certain part of “shared” Registry is located under
HKEY_LOCAL_MACHINE\Cluster\.... Each Clustered resource has it's own Key. To figure what the ID is you can either look at them and guess, or use
Get-ClusterResource -Name <Name> | fl -Property *. Under that key is another key
Parameters and the proper way to manipulate values here is with
Get-ClusterResource -Name NFS-clnfs | Get-ClusterParameter. However, what we need is even deeper down:
I found no way to manipulate a sub path like
img\RestrictChown with command line tools, so the only way left was to change the value on both servers, and reboot them.
I wish I found a better way…