Anatomy of a Kernel Hack
Reading Val's
Lesson
#5, "Your First printk", reminded me of something
that's been bugging me about the bogomips calculation. The kernel
does an experiment at boot time to determine how fast the processor
is, and it prints out the result, which I can see with dmesg:
% dmesg | grep Bogo
Calibrating delay loop... 1395.91 BogoMIPS
Apparently it only does this once, at boot time. But one of my
machines is a laptop with speedstep, which means that when it's
running off batteries it runs slower to save power. So if I boot it
plugged in, then unplug and run off batteries, the kernel is now wrong
in its idea of how fast the processor is. And there's currently no
way, as far as I know, to set it right.
Val was hanging out on IRC one afternoon, and I mentioned being
curious how hard it would be to write a driver that recalculated the
value and printed out the current value. She encouraged me to find
out, so I did, and it works! (Well, mostly, but I'll get to that.)
Here's what I did. I'll go through how I figured it out, step by
step, in case other kernel hacking newbies like me are curious. Val
was on IRC giving me suggestions and moral support throughout.
First, I wondered whether I should make it a device in /dev or
something in /proc. Val said /proc would probably be much easier,
and I didn't want to deal with making major/minor IDs, so /proc it
was. I decided to call my device /proc/bogomips.
Now, I had to figure out how /proc devices are made. So I did
% ls /proc
and I looked for something with a fairly unique name that I could
grep for in the kernel source. execdomains looked pretty unique
-- I have no idea what it is, but I did a search on that string.
Aside: I have an alias for doing a recursive grep in C source (that's
something I need to do all the time at work). It looks like this:
alias cgr "find . -name '*.h' -or -name '*.c' \
-or -name '*.cpp' -or -name '*.cc' | xargs grep \!* /dev/null"
So I went to my kernel source directory and did: cgr execdomains.
That pointed me to fs/proc/proc_misc.c. So (as root) I edited that
file and searched for "execdomains". I found a routine,
execdomains_read_proc, and a line down near the bottom that was
pretty clearly the place where the actual /proc node was getting
registered.
So I added my new line there, right underneath the execdomains line:
{"execdomains", execdomains_read_proc},
{"bogomips", bogomips_read_proc},
Then I went back to execdomains_read_proc, made a copy of it and
renamed it bogomips_read_proc.
Now I had to figure out what to put in bogomips_read_proc. It was
back to Val's lesson 5: she said to put the printk in init/main.c,
and that's where the bogomips calculation was, in a function called
calibrate_delay(void). It looked like all I had to do was call
calibrate_delay(), then check the variable loops_per_jiffy and
print it out, and I'd have my driver.
But how was I supposed to return it? execdomains_read_proc() had
an int len which it set to something, then it returned:
int len = get_exec_domain_list(page);
return proc_calc_metrics(page, start, off, count, eof, len);
Okay, so I didn't know what that was doing, but all the stuff in
proc_misc.c seemed to return proc_calc_metrics
like that, so I figured I'd better do that too. So I did this:
int len;
calibrate_delay();
len = loops_per_jiffy;
return proc_calc_metrics(page, start, off, count, eof, len);
Then I tried compiling: make bzImage from the top. (First I tried
doing a make in fs/proc, but that didn't work so I guess I have to
run it from the top every time.)
The compile failed -- it didn't know about loops_per_jiffy. So it's
back to recursive grep to see if there was any .h file which mentioned
that variable. (I have another alias, hgr, that does a recursive grep
but only looks in .h files.) Sure enough, linux/delay.h declares it,
so I added
#include <linux/delay.h>
near the top of proc_misc.c, and ran make bzImage again.
This time it compiled! Cool! So now it's time to try it:
cp arch/i386/bzImage /boot/vmlinuz-test
(edit /etc/lilo.conf to add an entry for vmlinuz-test)
lilo
(gulp, deep breath) reboot!
Cool! The new kernel booted and I was able to log in as myself.
I cd'ed to /proc, did an ls -- wow, bogomips is actually there!
So (deep breath) cat bogomips.
Yowch! screensful of garbage, beeping, and at the end my prompt
was in what looked like vt100 graphics characters. :-) Okay, that
wasn't quite it. But it didn't panic the system, and reset got
me my normal prompt back.
Back into X, su, xemacs, and look at my routine, and meanwhile
talk to Val on IRC. She suggested looking at what cpuinfo did, since
/proc/cpuinfo prints lots of information including bogomips. So I
looked at that for a while, and found out that /proc/cpuinfo is a
lot more complicated than I thought. :-) It's complicated because
it implements four operations -- open, read, llseek, and release --
rather than just read, and it does everything via routines named
seq_(variousthings). seq_read ultimately called
seq_printf off in
some other file (more recursive grepping here, I'll spare you),
and Val said seq_printf keeps track of the beginning and end of
the string that you pass in, and I decided that this was more
complicated than what I was after and I needed to find something
simple that I did understand.
So ls /proc and stare at the list, and I found something that looked
like it should be simple: /proc/loadavg prints a single line and I
understand what a loadaverage is, so maybe I can use that as a model.
Back to proc_misc.c and find loadavg_read_proc -- aha, it does
len = sprintf(various stuff);
return proc_calc_metrics(page, start, off, count, eof, len);
I figure I can do that.
So using init/main.c as a model for how to print out bogomips so they
look good, I do this:
static int bogomips_read_proc(char *page, char **start, off_t off,
int count, int *eof, void *data)
{
int len;
printk("Recalibrating delay loop... Initially %d\n", loops_per_jiffy);
calibrate_delay();
printk("Afterward: %d\n", loops_per_jiffy);
len = sprintf(page, "bogomips: %lu.%02lu\n",
loops_per_jiffy/(500000/HZ),
(loops_per_jiffy/(5000/HZ)) % 100);
return proc_calc_metrics(page, start, off, count, eof, len);
}
(I'm hoping HZ is just magically defined in some include file I'm
already including, and I got lucky there.) make bzImage tells me that
loops_per_jiffy is a long and %d is an int format. I hate code that
throws warnings so I change %d to %ld in both places, and rebuild. It
builds! I copy to /boot, lilo and reboot.
It boots! I log in as myself and (several deep breaths)
% cat /proc/bogomips
It printed out:
Recalibrating delay loop! Initially 6979584
Calibrating delay loop... 1395.91 BogoMIPS
Afterward: 6979584
bogomips: 1395.91
Recalibrating delay loop! Initially 6979584
Calibrating delay loop... 1395.91 BogoMIPS
Afterward: 6979584
Woohoo! That "Calibrating delay loop..." isn't from me, it's from
calibrate_delay() ... so my driver actually worked! For some reason
it called calibrate_delay() a second time, which isn't so good, but
basically it worked! I started X and ran it again, and this time I
just saw:
% cat /proc/bogomips
bogomips: 1395.91
(since all the other stuff was going to the console, not to my terminal).
Now for the real test: this was all done with the laptop plugged in.
So I unplugged it (now it's running on batteries and should be slower):
% cat /proc/bogomips
bogomips: 1097.72
It got it right! This is so cool! Plugged back in and it went back
up to 1395.91! I went back in and cleaned up -- removed the printk
calls, and also the "bogomips: " since Val said it was normal to just
print the value.
I still didn't know why it called calibrate_delay() again after it
prints the value, so I needed to figure that out. But I was pretty
jazzed -- I actually wrote a real (if very simple), useful driver that
does something I wanted anyway!
(The next day ...)
I've been thinking about why calibrate_delay_loop gets called twice
(that's bugging me). I put in some printk calls and discovered that
my read proc is being called twice; the arguments are all the same
except for offset, which is 0 the first time and 8 the second time.
The length it's returning (the number of characters from the sprintf)
is 8, so I'm guessing that cat is trying to read beyond that because
it didn't see an eof (i.e., read() didn't return 0).
I discovered that the O'Reilly device drivers book talks a little
about /proc, and that I'm supposed to set *eof=1 when I have nothing
more to return. But proc_calc_metrics is already doing that (and I
tried bypassing proc_calc_metrics and setting *eof=1 myself, which
didn't help), but the read proc is still called a second time with
offset=8.
It looks like the main difference is that the first time, it sets *start
and returns 8, whereas the second time, it returns 0 and doesn't set *start.
Of course, I could hack this by saying if (offset>0) return, or
perhaps by returning 0 instead of the length, but those doesn't seem
right. The O'Reilly book is pretty clear that I'm supposed to return
the number of characters, not zero. I want to understand why setting
*eof isn't doing it. But it looks like there are other /proc drivers
which have the same issue, so I've added a check for (offset > 0) into
the code, so it no longer calls calibrate_delay() twice.
Editor's note: Several other people on the linux-kernel mailing
list have noticed and discussed the EOF problem. At this time, no one
has implemented a general solution.
|