Tools for Debugging Samba
Recently I spent several unpleasant days trying to troubleshoot an authentication problem with some Samba shares. I will describe the problem later, but more important than the particular issue is the set of tools I picked up to troubleshoot Samba issues. Apparently this suite of tools is obvious to everybody else on the Internet but me, so I am writing them down for my future self.
The bad news is that Samba is complicated, and tracing what the
software suite is doing is tricky. The good news is that Samba has
lots of debugging infrastructure and is open
source. I have no intention of compiling Samba from source ever, but
the debugging logs print function names, so following along the
program flow by looking through source files was possible (and not
even that unpleasant, since the source code is fairly readable). I was
debugging on Ubuntu, so I grabbed the .orig.tar.xz
file from
http://packages.ubuntu.com/trusty/samba and followed along.
As indicated in the above link, I was using Samba 4.1 when doing this troubleshooting.
Log files and smbcontrol
Samba generates logs of log files. Thanks to the following line in
smb.conf
:
log file = /var/log/samba/log.%m
there is a separate log file generated by each host that connects to
the share. There are also log.wb-<DOMAIN>
and log.winbindd
files that are useful. If you are able to troubleshoot when the shares
are being used less, then going to the /var/log/samba
directory and
typing ls -ltc
will show you the latest log files that were
accessed. These files might have clues about the problematic
behaviours.
You can also increase the verbosity of log files on the fly, which is really handy. Log verbosity ranges from 0 to 10 (I guess the Samba developers never watched This is Spinal Tap.) Verbosity 10 prints out low-level data structures that are being sent and received, so it was helpful for me.
To increase the verbosity of winbind
logging to 10 on the fly, type:
smbcontrol winbind debug 10
You can change the verbosity of other components as well.
You want to decrease this back to 0 or 1 after troubleshooting, because there is a lot of logging that probably slows Samba down.
There is also a pdbedit
command which apparently is useful for
manipulating SAM password files, but I did not need it for this task.
Name Lookups and Winbind tools
My problem had to do with winbind
finding incorrect names from our
Active Directory LDAP database. The easiest way to look up a name was
to type:
getent passwd
or
getent passwd <username>
The wbinfo
command also is useful. wbinfo -u
seems to do much the
same as getent passwd
. wbinfo --lookup-sids
was useful to
translate Windows security identifiers into user names.
Restarting Samba and Clearing Cache
In several situations I found myself wanting to start and stop Samba. On Ubuntu, this meant typing:
service smbd stop
service nmbd stop
service winbind stop
and the corresponding "start" commands to restart the services.
To flush the caches, you are supposed to be able to type:
net cache flush
but this does not actually flush every cache.
You are supposed to be able to run winbind
with a -n
switch to
bypass caching, but this does not ignore all caches.
TDB caches and tdbtool
Samba caches information in a number of "trivial database" (.tdb)
files. On my Ubuntu system, these files were located in
/var/lib/samba
and /var/cache/samba
. Supposedly all of these
files (or maybe the ones in these folders?) are regenerated if they
are deleted. The Samba documentation
page
suggests that some of these files should not be deleted.
To dump the contents of a database use the tdbtool
command. For
example:
tdbtool <filename> dump
If you redirect this to a file you get something you can grep.
Domain Member Operations
It is easy to discover how to join a domain from Samba:
net ads join -U <admin-user-name>
It is less easy to discover how to unjoin a domain:
net ads leave -U <admin-user-name>
Windows LDAP Logging
This is not really Samba related but is really useful. You can make your domain controllers report the exact LDAP queries from clients. The trick is documented here: http://serverfault.com/questions/382665/are-there-alternatives-to-sysinternals-adinsight . Set the following registry values on your domain controller:
Path: HKLM\SYSTEM\CurrentControlSet\Services\NTDS\Diagnostics\15 Field Engineering
Type: DWORD
Value: 5
Path: HKLM\SYSTEM\CurrentControlSet\Services\NTDS\Parameters\Expensive Search Results Threshold
Type: DWORD
Value: 1
I am not sure why the second registry entry should be required. The documentation suggests it is for filtering certain types of LDAP queries. But I was not able to get things working without it. The following registry entry may also be relevant:
Path: HKLM\SYSTEM\CurrentControlSet\Services\NTDS\Diagnostics\16 LDAP Interface Events
Type: DWORD
Value: 2
but I do not know for sure. (It appeared to be set on my domain controller and I don't know why.)
In addition to these registry edits I also had to increase the size of the "Directory Service" event log. Go to Event Viewer, right-click the event log in question, choose "Properties" and then increase the "Maximum Log Size".
There is official Microsoft documentation here: https://support.microsoft.com/en-us/kb/314980
Process Tracking
At one point I needed to figure out what domain controller winbind
was connecting to:
lsof -i -n | grep winbind
strace
is really verbose but sometimes useful in figuring out what
filenames are being touched. There are a bunch of invocations you
could use. I first restarted winbind to get a process number
# service winbind restart
winbind stop/waiting
winbind start/running, process 5010
and then I called strace
to attach to this process:
strace -o /tmp/winbind.strace.01 -f -ttT -p 5010
-p
indicates the process to attach to. -f
attaches to
the process AND its children. -ttT
does some time output I don't
understand but is probably useful.
tcpdump
can be helpful in two ways: you can see whether your process
is talking to the hosts you think it is and you can generate PCAP
files that you can then use in Wireshark. To look at communications
between my machine 10.10.10.25
and the two domain controllers
10.10.10.2
and 10.10.10.3
I used the following invocation:
tcpdump -i eth0 -n 'host 10.10.10.25 and (host 10.10.10.2 or host 10.10.10.3)'
Note the single quotation marks. They are important because of shell parsing.
If I want to generate a PCAP file I use the -w
switch. In this case
we can also look for port traffic from my machine on port 389:
tcpdump -i eth0 -n -w /tmp/winbind.tcpdump.01.pcap 'host 10.10.10.25 and port 389'
Sidebar: The troubleshooting task
We have a Samba server which has some WPKG scripts. On shutdown, our staff machines are supposed to run these scripts. When logged in, we could run the scripts manually, but the shutdown scripts were not working.
In looking at the Samba smbd
logs for the machine, it looked like
the computer was trying to authenticate as the wrong machine. For
example, the log for machine new-machine04
would have the following
entries:
[2016/03/01 02:53:04.537446, 0] ../source3/auth/pampass.c:797(smb_pam_accountcheck)
smb_pam_accountcheck: PAM: Account Validation Failed - Rejecting User old-machine01$!
[2016/03/01 02:53:04.537730, 1] ../source3/auth/user_krb5.c:142(get_user_from_kerberos_info)
PAM account restrictions prevent user [TWC\old-machine01$] login
[2016/03/01 02:53:04.537759, 1] ../source3/auth/auth_generic.c:97(auth3_generate_session_info_pac)
Failed to map kerberos principal to system user (NT_STATUS_WRONG_PASSWORD)
[2016/03/01 02:53:04.637576, 0] ../source3/auth/pampass.c:797(smb_pam_accountcheck)
smb_pam_accountcheck: PAM: Account Validation Failed - Rejecting User old-machine01$!
[2016/03/01 02:53:04.637860, 1] ../source3/auth/user_krb5.c:142(get_user_from_kerberos_info)
PAM account restrictions prevent user [TWC\old-machine01$] login
[2016/03/01 02:53:04.637889, 1] ../source3/auth/auth_generic.c:97(auth3_generate_session_info_pac)
Failed to map kerberos principal to system user (NT_STATUS_WRONG_PASSWORD)
There was something going wrong with machine accounts. Removing the
computer from the domain, deleting the computer object, and then
re-joining the computer fixed the problem, but I did not know where
the old-machine01
entries were coming from. They were legitimate
machine names we had used in the past, but they were mapped
incorrectly. I was freaked out because I thought that our Active
Directory might be inconsistent, which would have been a huge problem.
Fortunately, the problem ended up being Samba's fault. The
smb_pam_accountcheck
routine uses information returned from
winbind
, and winbind
was referencing a cache file
/var/cache/samba/netsamlogon_cache.tdb
that is (a) stored in a
different place than other cache files and (b) is never flushed or
ignored, even if you specify the -n
parameter to the winbind daemon.
This file contained the mysterious old machine names.
I believe the problem is as follows: sometimes instead of reimaging a
computer, we would rename an existing computer and give it to somebody
new. This meant that the computer SID would not change. Active
Directory stored the new information (and a new machine account
password) and the netsamlogon_cache.tdb
file kept the old machine
account name with the same SID. This lookup would always fail, so
Samba would reject the connection request, so the shutdown script
would not work. Deleting or renaming the netsamlogon_cache.tdb
file
fixes the problem. Unfortunately I am so inefficient at
troubleshooting Samba that figuring this out took me days.