How to fix Linux boot problems

Linux

Booting, or "bootstrapping" for us older folk, is that deeply mysterious sequence of operations performed by your computer between the moment when you switch it on and the moment it's ready for you to log in. During this time, all kinds of incomprehensible messages scroll up the screen, but they're not something you usually take much notice of, and most linux distros cover them up with a pretty splash screen and a nice encouraging progress bar. This is all fine, of course, until it stops working. In this tutorial we'll examine the boot process in more detail, looking in particular at what can go wrong, and how to diagnose and fix the problem.

Grokking the problem

When I'm teaching Linux on one of my courses, many attendees tell me they are interested in troubleshooting of one form or another. Some of them are looking for a cookbook approach - "If you see the error message X, run command Y", but troubleshooting rarely works that way. My initial advice to anyone who needs to troubleshoot is always the same: "The most important thing in troubleshooting is to understand how the system is supposed to work in the first place. The second most important thing is figuring out exactly what the system was trying to do when it went wrong."

Figure 1: the normal sequence of events when booting Linux.

Figure 1: the normal sequence of events when booting Linux.

With this in mind, let's take a look at how Linux boots. Knowing the normal sequence of events, and determining how far it got before it ran into trouble, are key to diagnosing and fixing boot-time problems. Figure 1 above, right shows the normal sequence of events (green arrows) and indicates some of the possible failure paths (red arrows).

Picking yourself up by your bootstraps

Booting is a multi-stage affair. When a PC is powered up, control initially passes to a program (called the BIOS) stored in read-only memory on the motherboard. The BIOS performs a self-test of the hardware and scouts around looking for a device to boot from. The BIOS provides configuration screens that allow you to assign the order in which it searches for a bootable device, and modern BIOSes support a wide range of boot devices, including PXE booting from a network server. The only case we consider here is booting from the hard drive.

The BIOS loads the Master Boot Record (MBR) of the selected boot device and executes it. (If this fails, the BIOS will report something like "Missing Operating System", and come to a screaming halt.) The MBR occupies the very first sector of the drive. It holds the drive's partition table (64 bytes) and a very short program (446 bytes) which is 'stage 1' of the bootstrap loader.

This stage 1 loader is pretty dumb - all it does is to display the word GRUB on the screen then load a second stage boot loader using a 'block map' that is embedded into the MBR. (The block map specifies the disk block numbers where the second stage loader resides). I'm assuming here that we're using the Grub boot loader. There's an earlier boot loader called Lilo, but Grub is more recent, smarter, and used in most modern Linux distros. The second stage of the Grub boot loader is actually called stage one-and-a-half, and if you list the directory /boot/grub you can see the files that contain the various versions of this; they have names like e2fs_stage_1_5 and reiserfs_stage_1_5.

Each of these programs is able to access files by name using one particular filesystem format. e2fs_stage_1_5 can read ext2 and ext3 filesystems, and reiserfs_stage_1_5 can read reiser filesystems, and so on. Grub's ability to access files by name at boot time (before linux is running) is the thing that really sets it apart from Lilo. The stage one-and-a-half program loads Grub stage 2 which is considerably larger. This stage reads the Grub configuration file (usually /boot/grub/menu.lst or /boot/grub/grub.conf) and, based on the entries it finds there, it presents a menu of choices of the operating system you want to boot. If Grub can't find its config file it will drop down to an interactive command-line prompt to allow you to enter Grub commands manually.

A typical entry in menu.lst looks like this:

title openSUSE 10.2
    root (hd0,0)
    kernel /boot/vmlinuz-2.6.18.2-34-default root=/dev/hda1 vga=0x317 showopts
    initrd /boot/initrd-2.6.18.2-34-default

The title line simply specifies the text that will appear in Grub's boot-time menu. The lines that follow specify the commands that Grub will execute if you select that item from the menu. The root line sets Grub's idea of where the root filesystem resides. Grub has its own way of naming disk partitions which is confusingly different from the naming scheme used by Linux.

In Grub-speak, hd0 refers to the first drive - on a typical PC with IDE drives this corresponds to the Linux device name /dev/hda, or, in some of the more recent distros, /dev/sda. In Grub-speak, (hd0,0) refers to the first partition on that drive. Linux would call this /dev/hda1 or /dev/sda1. The kernel line specifies the file that Grub will load as the Linux kernel; at the end of this line you may see some additional boot parameters that are passed to the kernel. More about these later.

The initrd line specifies the file that contains the 'initial RAM Disk' - a file system image that will be used by the kernel as it boots. Grub is also responsible for loading this file into memory. If Grub fails to find the kernel or the ramdisk images it will report Error 15: File not found, and halt.

Once the kernel starts running, it mounts the root file system from the hard drive. The name of the partition that holds this file system is passed as a parameter to the kernel, as you can see from the menu.lst entry above. Mounting the root file system is a key point in the boot process and if you're trying to pin down a boot-time problem it's vital to figure out if the kernel was able to get this far. Failure to mount the root file system will generally result in a kernel panic, though some systems just appear to halt.

If the kernel succeeds in mounting the root file system, it creates a single process (with process ID 1) which runs the program /sbin/init. If the kernel can't find init, it will either panic and halt or (depending on the distro) drop you into a root shell. Oh, by the way, just to add a little confusion, Ubuntu doesn't use init any more, it uses a replacement called upstart.

The program init is responsible for running the scripts that will start all the other services in the system. There is one important and rather low-level script run by init early in the process. On Red Hat-style systems it's /etc/rc.d/rc.sysinit, on SUSE it's /etc/init.d/boot. Among other things, these early scripts consistency-check and mount any other disk partitions, as specified in /etc/fstab. Although there is certainly plenty of scope for things going wrong at this stage, we need to leave the story at that point, at least for this month.

Getting to grips with Grub

A key skill in fixing boot-time problems is knowing how to manually intervene in the Grub boot sequence. Most distros configure Grub to boot a default choice from the menu, but allow a short time window (a few seconds) in which you can press Esc to interrupt this and gain direct control of Grub. Typically this will exit from the Grub splash screen and drop to a character-based menu.

From here, follow the on-screen instructions to select an item from the menu, and edit the commands associated with that menu selection before booting. It's even possible to drop down to a Grub command prompt where you can enter Grub commands directly; for example at this point you could, in theory, manually enter the root, kernel and initrd lines from the menu.lst file we looked at earlier. Figure 2 (below) shows the result of typing help at the Grub command prompt.

Figure 2: the result that is output when 'help' is typed at the Grub command prompt.

Figure 2: the result that is output when 'help' is typed at the Grub command prompt.

Rescue Booting

If no amount of tweaking with the Grub boot commands will allow your system to boot, it may be time to perform a 'rescue boot', which means that you'll boot Linux from an installation CD or other 'rescue media'. The kernel and its modules are loaded off the CD, along with a small file system that is built in memory. This results in a small but working Linux system that isn't dependent on any file systems on the hard drive.

You can then mount the hard drive's partitions into the file system and access them to repair the damage. The installation DVD or CD of almost any modern Linux distribution can be used for this purpose - there is no requirement that the rescue media is from the same distribution as the one you're trying to rescue.

It's important to keep a clear head when using a rescue system because the files on your hard drive won't be in the same place when they are mounted into the rescue system as when they're viewed by the 'real' installation. For example, if in the rescue system I were to mount my system's root partition onto /mnt, then the file I would normally see as /etc/fstab will be seen as /mnt/etc/fstab.

Case Study 1

Our first case study concerns a RHEL5 system on which the /usr directory had been placed on a separate partition. For some reason (I forget what) I had to dictate an edit to /etc/fstab over the phone to an newbie administrator and he ended up with the field LABEL=/user instead of LABEL=/usr in the line that specified the mounting of the /usr partition. This error causes the system to fail to come up multi-user, but instead drops into a single-user shell. The dialogue looks like this:

fsck.ext3: Unable to resolve 'LABEL=/user'        [FAILED]

*** An error occurred during the file system check.
*** Dropping you to a shell; the system will reboot
*** when you leave the shell
Give root password for maintenance
(or type Control-D to continue):

The important thing is to actually read the error message. Why is it trying to find a label called /user? In which file would this appear?

Hmmm ... Maybe /etc/fstab? Since the system has graciously dropped us into a root shell, you would imagine it would be straightforward to edit /etc/fstab and fix the error. However, it turns out that the root file system is mounted read-only at this point (for the benefit of fsck) so you are not able to edit the file and save the result. The trick here is to change the mount options for the root partition to mount it read-write without actually unmounting it, like this:

# mount -o remount,rw /

After doing that, I'm able to edit /etc/fstab, correct the error, and reboot normally.

Case Study 2

Our second case study is more complex than the one above, and involves a dual-boot scenario.

First I installed OpenSUSE 10.2 onto an empty hard drive, allocating an 8GB partition hda1 as the root partition and a 2GB partition hda2 for the swap partition. The installer wrote the stage 1 Grub loader into the master boot record, and the stage 1.5 loader into the disk blocks immediately following. Some time later, I installed Fedora 7 onto the spare space on the hard drive, allocating a further 8 GB partition hda3 for its root file system and sharing hda2 with the OpenSUSE installation as the swap partition.

Figure 3 (below) might help if you're getting lost already. I allowed Fedora to go ahead with its default selection to install Grub onto /dev/sda. (The naming is confusing because OpenSUSE 10.2 names the partitions in the traditional way, as hda1, hda2 and so on, whereas Fedora 7 uses SCSI-style naming for the self-same partitions - sda1, sda2 and so on.)

Figure 3: dual-boot configuration show interaction between the SUSE and Fedora partitions.

Figure 3: dual-boot configuration show interaction between the SUSE and Fedora partitions.

During the Fedora installation I was offered the opportunity to add additional items into the Grub boot menu so I asked it to install an extra item with the label SUSE 10.2 and the device name /dev/sda1. This results in an entry in Fedora's Grub config file like this:

title SUSE 10.2
      rootnoverify (hd0,0)
      chainloader +1

If I select this item from the menu, Fedora's Grub will in turn load OpenSUSE's Grub from the first sector of the first partition (which Grub calls (hd0,0) and Fedora 7 calls /dev/sda1 and OpenSUSE calls /dev/hda1 - does this really have to be so confusing?) So now I can choose either Fedora or SUSE at boot time, although booting SUSE does involve the minor annoyance of having to go through a second Grub boot menu. If I had wanted to get clever, I could have copied the stanza for booting SUSE from the Grub config file in hda1 to the Grub config file in hda3, which is marginally more convenient, as I can boot direct into SUSE without having to go through a second menu.

Tipping Fedora

Anyway, a few weeks later, I decided that I no longer needed the Fedora installation and that instead I'd like to re-use the partition in the SUSE installation. So (while booted into SUSE) I ran:

# mke2fs -j /dev/hda3

...to create a new file system on the partition. Remember, hda3 was the root filesystem of my Fedora installation. Then I mounted my newly created file system onto /mnt. Everything was going fine until I tried to reboot. This is when the, er, products of digestion hit the fan, because at this point, Grub fails. I can't boot into Fedora (OK, I had deliberately destroyed that partition, so I expected that) but more seriously, I can no longer boot SUSE. In fact, all that happens is that the single word "GRUB" appears on the screen.

The secret at this point is: Don't Panic! Think back about what we did. Take another look at Figure 3, and remember that the contents of hda3 have now been overwritten. The stage 1.5 loader installed by the Fedora installation references a stage 2 loader in a file system on hda3 that no longer exists. What we need to do is re-install the original stage 1 and stage 1.5 Grub files in order to restore the capability to boot SUSE 10.2.

The only way forward is to boot from rescue media. As I mentioned earlier, almost any Linux installation CD will do the job here, but I chose to use Ubuntu 7.04 which is designed as a 'live CD'; that is, it will boot a full, working Linux system including a Gnome desktop, from CD.

Ubuntu, like Fedora, names the hard disk partitions as sda1, sda2, so once my live CD has booted I can mount my SUSE root partition into the Ubuntu file system like this:

$ sudo mount /dev/sda1 /mnt

(In case you were wondering about the sudo in this command, Ubuntu disables direct root logins and requires me to use sudo each time I run a command with root privilege. On the live CD distribution, no password is required to do this.) Now I need to run Grub to re-write the MBR. At this stage, I'm not entirely sure where the grub command is (I'm looking for the copy that SUSE installed; that is, the one on hda1), but by running:

$ sudo find /mnt -name grub

I quickly discover that it's in /mnt/usr/sbin/grub. Now I can run it and enter commands at the Grub command prompt like so:

$ sudo /mnt/usr/sbin/grub

grub> root (hd0,0)
grub> setup (hd0) (hd0,0)
 Checking if "/boot/grub/stage1" exists... yes
 Checking if "/boot/grub/stage2" exists... yes
 Checking if "/boot/grub/e2fs_stage1_5" exists... yes
 Running "embed /boot/grub/e2fs_stage1_5 (hd0)"... 15 sectors are embedded. succeeded
 Running "install /boot/grub/stage1 (hd0) (hd0)1+15 p (hd0,0)/boot/grub/stage2 /boot/grub/menu.lst"... succeeded
Done.
grub> quit

The key command here is setup (hd0) (hd0,0) which tells Grub to re-install the original (SUSE) stage 1 into the MBR of (hd0), and copies the original (SUSE) stage 1.5 into the sectors immediately following. (For a more detailed description of this, look up the install and embed commands in the GNU manual at www.gnu.org/software/grub/manual/grub.html) Once this is done, I can simply reboot and... Lo! My SUSE Grub boot menu is back!

International rescue

Live CDs such as Ubuntu or Knoppix do the job of rescue booting just fine, but there are smaller, faster rescue boot disks out there. One you might want to take a look at is SystemRescueCD (see www.sysresccd.org) which not only boots much faster but also contains an extensive collection of tools for managing and editing partitions, including classics like gparted and more unusual items such as NTFS-3g which allows you to access Windows NTFS partitions with full read and write support - great for rescuing files off that old malfunctioning Windows XP partition! Another benefit of this rescue CD is that it can cache its entire file system in memory, allowing you to remove the CD and insert another if necessary.

Kernel parameters

A common reason for modifying the Grub commands is to pass boot-time parameters to the kernel. These parameters are simply appended to the end of the kernel line. You can get a list of the boot time kernel parameters with the command man bootparam, or you can buy Greg Kroah-Hartmans' little book Linux Kernel in a Nutshell (where else would you find a kernel? Read it for free at www.kroah.com/lkn/) which devotes an entire chapter to them.

If you have installed the kernel source code you will also find a list of kernel parameters in /usr/src/linux/Documentation/kernel-parameters.txt. Be aware that many parameters relate to optional features that may or may not be configured into the kernel. The table lists just a few to give you the idea. Once Linux is up and running, you can examine the boot parameters that were passed to it by examining /proc/cmdline.

  • quiet: Suppresses kernel logging except for warnings and errors. Do not use this if you're trying to debug a boot-time problem.
  • splash: Displays a splash screen and progress bar, covering up the kernel messages. Again, remove this option if you're debugging.
  • single or S: Causes the kernel to run init to bring the system up into single-user run level. Depending on configuration, you may or may not be asked for the root password; then you'll be given a root shell. The X server and graphical desktop will not be started, nor will any partitions be mounted (except for the root partition of course). This is a useful mode for tackling X server startup problems or for repairing file systems.
  • vga=ask: This option allows you to select the mode for the video adaptor. The video mode sets the screen resolution (in character rows and columns) and you'll be prompted to select a video mode from a list. Occasionally useful if your monitor doesn't sync up to the default mode.
  • acpi=off: Disables APCI (Advanced Configuration and Power Interface).
  • init=/bin/sh: Run the program /bin/sh (the shell) instead of init. This is even more extreme than doing a single-user boot and performs almost no user-level initialisation. Booting is extremely fast, and the shell will be the only user process running.
First published in Linux Format

First published in Linux Format magazine

You should follow us on Identi.ca or Twitter


Your comments

Correction

I hope this isn't niggling, but there's a typo above:

/etc/rc.d/c.sysinit should be /etc/rc.d/rc.sysinit

This goes a long way toward explaining my problem

I've been having a dual-boot problem for a while. My MBR points to Grub for a Mandriva 2008 system. I've installed several other systems into other partitions with their boot loaders installed into their local partitions. For each of thses, I've added a 'chain loader' entry to the original MDV2008 menu.lst.

Some of these work, and some do not. And I suspect it's because I have 2 drives, and some of the installers have mapped the drives so that so that /dev/sda is hd0 and /dev/sdb is hd1, and others have mapped them the other way.

I always thought that using the chain loader would just 'load Grub' from the chained partition. And by that, I mean 'load Grub from scratch'. But this article implies that it loads stage 1.5 from the MBR system and then tries to continue from there. If the drive mappings don't match it won't work. If there's a good reason for chain loading to work this way, I wish someone would explain it.

Anyway, I still don't understand all of it. Specifically:

One of my chained systems is Mandriva 2009. When I first installed it, it booted up fine with the chain loader. But after the installation, Grub only worked once. After that, I saw that it had created device.map 'backwards'. This must've happened as part of some 'post installation' script run on the first boot.

I was able to mount the MDV2009 system under MDV2008 and edit device.map and menu.lst to match the 2008 system, and after doing that, the 2009 system boots. But every time Mandriva sends out a kernel upgrade, it rejiggers the grub files back to the wrong state. Nowadays, I know to just edit the grub files after an MDV2009 kernel upgrade and save myself some headaches.

What I don't know is why the MDV2009 system insists on seeing the drives backwards. I recently noticed on the Mandriva boot loader config screen that there are radio buttons where you're supposed to tell it which drive the MBR is located on when you configure Grub to load to the partition instead of the MBR. I figured I must've forgotten to set that and that reinstalling the boot loader with the correct MBR setting would solve my problem. It didn't.

Any ideas?

Great article, thanks for sharing

Great article, many thanks for sharing.

Explains the Boot Process & GRUB very well !

Drive upgrade after installing Multiboot OS

I have successfully installed Solaris 10 dual boot with XP on a Dell D600 laptop years back.
The 40G Hitachi drive was partitioned as 12G for Solaris and the rest for XP.

Now, I moved a long way in life since then. There's no space left in the XP side for my photos and videos.

I am trying to upgrade my existing 40G hard drive to, say, a 160G hard drive.
How can I mirror the existing dual boot OS in the old drive exactly to the new drive?

In other words, I'd like to see the same startup with 2 options for me to boot into - XP & Solaris, but the XP side will have a looooooooooot of free space now.

booting problem in linux...

i have a dell studio laptop...
i had windows 7 installed in it , then i installed ubunto in it... i successfully installed but in my grub loader it is not showing my win 7... but wen i open linux.. it is there...
what shall i do...
plz do help!!!

booting problem

when i restart the system so after booting show the reapair the system .......please help me for this problem .

Something simple found here..

After being a bit more than novice Oracle DBA, and unix/linux guy, I've moved into the Oracle RAC arena in the last 5 days, but was never forced to really figure it if things went really willy-nilly, well they did this week. I created a logical volume using system-config-lvm from a firewire device, and then upon reboot, the system hung on LogVol03 from my device. Oh crap, but this article gave me a way out. I tried single node, but couldn't modify the file, but the "mount -o remount,rw /". LVM's are a little different, so I've learned not to put them in /etc/fstab. Thanks Pearl

linux problem HELP!

Heres my problem. I used my linux notebook to transfer some files but i cant copy it because it change all the file into read-only. Then i shutdown it properly. When i tried to open it again then a windows appears an it says:

Your session only lasted less than 10 seconds. If you have not logged out yourself, this could mean that there is some installation problem or that you may be out of diskspace. Try logging in wioth one of the failsafe sessions to see if you can fix the problem.

View Details:
/etc/gdm/PreSession/Default: Registering you session with utmp

/etc/gdm/PreSession/ Default: running:/usr /bin /sessreg –a –u /var /run /utmp –x “/var /gdm /:0.xservers” –h “’ -1” :0” “qube”

Localuser: qube being added to access control list

** (gnome-session:1979): WARNING ** can not stat /tmp /orbit-qube

**ERROR**: Resource problem creating’ /tmp /orbit-qube’
Aborting..

Please help me.

heres my email add: cayabyabdariusdick@yahoo.com

sigle kernal mode

how to edit kernal parameter into single mode for repairing of file system of my ipcop

pls also mail ur reply at:
usmanan.amin@gmail.com

thanks alot in advance

sigle kernal mode

how to edit kernal parameter into single mode for repairing of file system of my ipcop

pls also mail ur reply at:
usmaan.amin@gmail.com

thanks alot in advance

This is really a great article

This is awesome article published here...

This made us understand the complete boot process with ease.

This made my work simple.

Very well written!

Thank you for the good explanation.
Now things are clear.

Regards,

Error .... After Installation :(

sir ... i have installed redhat linux 5.1

my ram is 256mb

my processor is 2. GHZ above

========

i got error after installation i.e "network ssh error generating ssh2 dsa host key"

Solution plezz sir ..

my id is ...

sarde.tushar@hotmail.com

sir ... i have installed

sir ... i have installed redhat linux 5.1

bt prodlem is show now "kernel panic -not syncing:attempted to kill init !"

sir plz how to resolv a problem..........

my id is singh.rahul542@gmail.com

Help

my linux won't start...

mountall: fsck / [431] terminated with status 4
mountsll: Filesystem has errors: /
init: mountall main process (423) terminated with status 3
mount of filesystem failed.

what can I do I don't have the cd anymore. email me at jb_westbound@hotmail.com

SUSE servers can't boot now after unexpected power off.blankscr.

SUSE servers can't boot now after unexpected power off.after splash screen goes blank.
In fail safe(grup) mode server gives this error
sck failed. please repair manually and reboot. the root filesystem is currently mounted read-only. to remount read-write do :
bash# mount -n -o remount ,rw /
is this problem solve through grup mode or failsafe mode.i m new in linux . Could anyone help me.......

Thanks in Advance.

SUSE servers can't boot now after unexpected power off.blankscr.

SUSE servers can't boot now after unexpected power off.after splash screen goes blank.
In fail safe(grup) mode server gives this error
sck failed. please repair manually and reboot. the root filesystem is currently mounted read-only. to remount read-write do :
bash# mount -n -o remount ,rw /
is this problem solve through grup mode or failsafe mode.i m new in linux . Could anyone help me.......

Thanks in Advance.

brightness problem

i have installed linux on my machine successfully but when i boot it the visibility on my screen is zero but i can see traces of linux running in the background i.e its very difficult to see whats going on on the screen please help me to fix this problem>

brightness problem

i have installed linux on my machine successfully but when i boot it the visibility on my screen is zero but i can see traces of linux running in the background i.e its very difficult to see whats going on on the screen please help me to fix this problem>

GRUB2 please...

This article is very well written and can get someone on his feet when it comes to troubleshooting boot problems. I wish there is an article that has the same clarity as this one on GRUB2?

/Boot

i m using CentOS6.3 and deleted /boot plz let me know how can i recover all files and directory of /boot

I Can't use any of my Linux Install Disks!

I have a personal built PC with an AMD eight core processor, a 64GB SSD hard drive with Windows 7 Ultimate OS on it, and two other hard drives along with 16GB of memory.

I made an Agilia Linux install disk to use as a partition adjuster a long while back, but now when I load it like I have always done, it seems to freeze or at least not allow me to use the keyboard or mouse. I figured it might have something to do with my disk drive, but when I use my other two drives I get the same result. I got frustrated and decided to just go with another version of Linux (OpenSuse) to see if it still did the same thing, and of course I had the same problem. So, I tried a different version, in this case Knoppix, and again I could do nothing after it booted to the home screen. I am a newbie to Linux, but to not be able to use any one of the 4 different versions I have strikes me as an odd coincident, if you can call it that.

Is there anyone that has had the same problems as I have or at least is able to give me some ideas as to what is going on?

Re: Can't use Linux install disks

Silverdragon_1900: Make sure you are trying to load the linux distribution with the proper architecture. You mention you have an AMD processor, and I would bet that it is 64-bit. I don't know if the distributions still break it out between Intel and AMD processors, but you want to make sure you grab the correct bit depth for your processor.

one of the best boot descriptions Ive seen

TYVM. Been doing unix for 20 years, branching out into Linux. very comprehensive and easy-to-understand doc. Data like this,fleshed out with more troubleshooting tips (i.e. process/memory, SAN, network,etc) in a comprehensive pdf file, would be well worth buying.

grub loading error

grub loading
no such partions
entering rescue mode _

Hi boss this is the problem in my laptop.
can any one please help me

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Post new comment

CAPTCHA
We can't accept links (unless you obfuscate them). You also need to negotiate the following CAPTCHA...

Username:   Password:
Create Account | About TuxRadar