Installing open source grid engine 6.2 (both u4 and u5): qmaster does not start

Hi
I've been trying all day to install sge6.2, I've tried both u4 and u5, on my cluster (16nodes); actually it does not even install the master node! I am using the gui_installer but the problem can be easily reproduced also out of the installer

it basically consists on two failures:

1) the daemon (launched by hand e.g. sgeroot/bin/arch/sge_qmaster or by /etc/init.d/sgemaster.p6444) dies after half a minute (I can see it appears in the running jobs, eg by "top" but after a while it disappears)

2) qping does not find the daemon, even in the first 30 seconds when it is alive:

> /usr/local/sge6_2u5/bin/lx24-amd64/qping -info clu.01 6444 qmaster 1

returns

endpoint clu.01/qmaster/1 at port 6444: got send error
got select error: Connection refused
got select error: closing "clu.01/qmaster/1"

(the host is of course "clu.01")

any idea how can I proceed to understand this issue? why qping does not connect (ports ARE open)? why the daemon dies after some time of survival?

thanks a lot

Andrea

Diagnosing a Failing QMaster

There are a couple of different issues that can can the master to fail to start.  The port might already be in use.  The file system might be full.  The data spool might be corrupted.  Etc, etc, etc.

The first place to look is in the "messages" file in the "qmaster" directory in the master spool directory.  If something's going wrong it should show up there.

Personally, I always just start with the bigger hammer.  I turn on debug output and start the qmaster by hand.  See http://blogs.sun.com/templedf/entry/using_debugging_output  If you know how to read the debug output, it'll tell you exactly what's wrong.

a problem of hostnames...

thanks a lot for the response, templedf, it opened me a world (it's exactly what I was looking for)

with debug level 1 I get the first clue (this is what I get when sge_qmaster is run)

--------------------------------------------------------------------------------------------------------------------
0 8415 140558720419568 **** starting localization procedure ... ********
1 8415 140558720419568 could not get environment variable "GRIDPACKAGE"
2 8415 140558720419568 could not get environment variable "GRIDLOCALEDIR"
3 8415 140558720419568 setlocale() returns "en_US.UTF-8"
4 8415 140558720419568 cutting of language string after "_":
5 8415 140558720419568 locale directory: >/usr/local/sge6_2u5/locale<
6 8415 140558720419568 package file: >lx24-amd64/gridengine.mo<
7 8415 140558720419568 language (LANG): >en<
8 8415 140558720419568 loading message file: /usr/local/sge6_2u5/locale/en/LC_MESSAGES/lx24-amd64/gridengine.mo
9 8415 140558720419568 could not open message file - error
10 8415 140558720419568 setlocale() returns "en_US.UTF-8"
11 8415 140558720419568 bindtextdomain() returns "/usr/local/sge6_2u5/locale"
12 8415 140558720419568 textdomain() returns "lx24-amd64/gridengine"
13 8415 140558720419568 error id output : disabled
14 8415 140558720419568 **** starting localization procedure ... failed **
15 8415 140558720419568 sge_qmaster is not daemonized
16 8415 140558720419568 returning port value: 6445
17 8415 main error: reresolve hostname failed: can't resolve host name
18 8415 main error: reresolve hostname failed: can't resolve host name
(etc... it goes on with identical lines until I kill it or it dies alone, probably bored to death)
--------------------------------------------------------------------------------------------------------------------

What I cannot understand is the reason of this hostname failure. In fact:
> utilbin/lx24-amd64/gethostname
Hostname: clu.01
Aliases:
Host Address(es): 192.168.0.1

which is correct!

Is there a real need for alieases? I don't think. Is there a problem with a hostname containing dots?

I am quite ignorant about the issue of resolving names. I have not installed any dns (eg bind). A couple of years ago I installed sge on a small cluster (3 nodes) with similar configuration and it went smooth, no dns installed etc.

Btw: I can ignore the first messages about missing variables, can't I?

I've also tried debug level 2, which is much more heavy, I post it below.

Thanks for any further help!

Andrea

---------------------------------------------------------------- debug level 2 ----------------------------

0 8485 140455078328048 --> qmaster() {
1 8485 140455078328048 --> sge_get_root_dir() {
2 8485 140455078328048 <-- sge_get_root_dir() ../libs/uti/sge_arch.c 136 }
3 8485 140455078328048 --> sge_init_language_func() {
4 8485 140455078328048 <-- sge_init_language_func() ../libs/uti/sge_language.c 455 }
5 8485 140455078328048 --> sge_init_language() {
6 8485 140455078328048 ****** starting localization procedure ... **********
7 8485 140455078328048 could not get environment variable "GRIDPACKAGE"
8 8485 140455078328048 could not get environment variable "GRIDLOCALEDIR"
9 8485 140455078328048 --> sge_get_root_dir() {
10 8485 140455078328048 <-- sge_get_root_dir() ../libs/uti/sge_arch.c 136 }
11 8485 140455078328048 setlocale() returns "en_US.UTF-8"
12 8485 140455078328048 cutting of language string after "_":
13 8485 140455078328048 locale directory: >/usr/local/sge6_2u5/locale<
14 8485 140455078328048 package file: >lx24-amd64/gridengine.mo<
15 8485 140455078328048 language (LANG): >en<
16 8485 140455078328048 loading message file: /usr/local/sge6_2u5/locale/en/LC_MESSAGES/lx24-amd64/gridengine.mo
17 8485 140455078328048 could not open message file - error
18 8485 140455078328048 setlocale() returns "en_US.UTF-8"
19 8485 140455078328048 bindtextdomain() returns "/usr/local/sge6_2u5/locale"
20 8485 140455078328048 textdomain() returns "lx24-amd64/gridengine"
21 8485 140455078328048 error id output : disabled
22 8485 140455078328048 ****** starting localization procedure ... failed **
23 8485 140455078328048 <-- sge_init_language() ../libs/uti/sge_language.c 381 }
24 8485 140455078328048 --> sge_daemonize_qmaster() {
25 8485 140455078328048 sge_qmaster is not daemonized
26 8485 140455078328048 <-- sge_daemonize_qmaster() ../daemons/qmaster/sge_qmaster_threads.c 188 }
27 8485 140455078328048 --> sge_qmaster_thread_init() {
28 8485 140455078328048 --> sge_setup2() {
29 8485 140455078328048 --> sge_get_execd_port() {
30 8485 140455078328048 returning port value: 6445
31 8485 140455078328048 <-- sge_get_execd_port() ../libs/uti/sge_hostname.c 286 }
32 8485 140455078328048 --> sge_gdi_ctx_class_create() {
33 8485 140455078328048 --> sge_gdi_ctx_setup() {
34 8485 140455078328048 --> sge_env_state_class_create() {
35 8485 140455078328048 --> sge_env_state_setup() {
36 8485 140455078328048 <-- sge_env_state_setup() ../libs/uti/sge_env.c 157 }
37 8485 140455078328048 <-- sge_env_state_class_create() ../libs/uti/sge_env.c 126 }
38 8485 140455078328048 --> sge_prog_state_class_create() {
39 8485 140455078328048 --> sge_prog_state_setup() {
40 8485 140455078328048 <-- sge_prog_state_setup() ../libs/uti/sge_prog.c 895 }
41 8485 140455078328048 <-- sge_prog_state_class_create() ../libs/uti/sge_prog.c 808 }
42 8485 140455078328048 --> sge_path_state_class_create() {
43 8485 140455078328048 --> sge_path_state_setup() {
44 8485 140455078328048 <-- sge_path_state_setup() ../libs/uti/setup_path.c 692 }
45 8485 140455078328048 <-- sge_path_state_class_create() ../libs/uti/setup_path.c 585 }
46 8485 140455078328048 --> sge_bootstrap_state_class_create() {
47 8485 140455078328048 --> sge_bootstrap_state_class_init() {
48 8485 140455078328048 <-- sge_bootstrap_state_class_init() ../libs/uti/sge_bootstrap.c 715 }
49 8485 140455078328048 --> sge_bootstrap_state_setup() {
50 8485 140455078328048 --> sge_get_confval_array() {
51 8485 140455078328048 <-- sge_get_confval_array() ../libs/uti/sge_spool.c 660 }
52 8485 140455078328048 <-- sge_bootstrap_state_setup() ../libs/uti/sge_bootstrap.c 866 }
53 8485 140455078328048 <-- sge_bootstrap_state_class_create() ../libs/uti/sge_bootstrap.c 663 }
54 8485 140455078328048 --> feature_initialize_from_string() {
55 8485 140455078328048 --> feature_get_featureset_id() {
56 8485 140455078328048 <-- feature_get_featureset_id() ../libs/sgeobj/sge_feature.c 413 }
57 8485 140455078328048 --> feature_activate() {
58 8485 140455078328048 <-- feature_activate() ../libs/sgeobj/sge_feature.c 300 }
59 8485 140455078328048 <-- feature_initialize_from_string() ../libs/sgeobj/sge_feature.c 200 }
60 8485 140455078328048 --> sge_csp_path_class_create() {
61 8485 140455078328048 --> sge_csp_path_setup() {
62 8485 140455078328048 sge_csp_path_setup:../libs/uti/sge_csp_path.c:316
63 8485 140455078328048 <-- sge_csp_path_setup() ../libs/uti/sge_csp_path.c 447 }
64 8485 140455078328048 <-- sge_csp_path_class_create() ../libs/uti/sge_csp_path.c 265 }
65 8485 140455078328048 <-- sge_gdi_ctx_setup() ../libs/gdi/sge_gdi_ctx.c 725 }
66 8485 140455078328048 <-- sge_gdi_ctx_class_create() ../libs/gdi/sge_gdi_ctx.c 461 }
67 8485 140455078328048 --> sge_gdi_set_thread_local_ctx() {
68 8485 140455078328048 --> sge_bootstrap_state_set_thread_local() {
69 8485 140455078328048 --> sge_bootstrap_state_class_init() {
70 8485 140455078328048 <-- sge_bootstrap_state_class_init() ../libs/uti/sge_bootstrap.c 715 }
71 8485 140455078328048 <-- sge_bootstrap_state_set_thread_local() ../libs/uti/sge_bootstrap.c 159 }
72 8485 140455078328048 <-- sge_gdi_set_thread_local_ctx() ../libs/gdi/sge_gdi_ctx.c 247 }
73 8485 140455078328048 <-- sge_setup2() ../libs/gdi/sge_gdi_ctx.c 1954 }
74 8485 140455078328048 --> gdi2_reresolve_qualified_hostname() {
75 8485 140455078328048 <-- gdi2_reresolve_qualified_hostname() ../libs/gdi/sge_gdi_ctx.c 2001 }
76 8485 140455078328048 <-- sge_qmaster_thread_init() ../daemons/qmaster/setup_qmaster.c 258 }
77 8485 140455078328048 --> sge_gdi_ctx_class_prepare_enroll() {
78 8485 main --> gdi2_reresolve_qualified_hostname() {
79 8485 main <-- gdi2_reresolve_qualified_hostname() ../libs/gdi/sge_gdi_ctx.c 2001 }
80 8485 main --> sge_gdi_ctx_class_error() {
81 8485 main --> sge_error_verror() {
82 8485 main error: reresolve hostname failed: can't resolve host name
83 8485 main <-- sge_error_verror() ../libs/uti/sge_error_class.c 264 }
84 8485 main <-- sge_gdi_ctx_class_error() ../libs/gdi/sge_gdi_ctx.c 517 }
85 8485 main <-- sge_gdi_ctx_class_prepare_enroll() ../libs/gdi/sge_gdi_ctx.c 965 }
86 8485 main --> sge_gdi_ctx_class_prepare_enroll() {
87 8485 main --> gdi2_reresolve_qualified_hostname() {
88 8485 main <-- gdi2_reresolve_qualified_hostname() ../libs/gdi/sge_gdi_ctx.c 2001 }
89 8485 main --> sge_gdi_ctx_class_error() {
90 8485 main --> sge_error_verror() {
91 8485 main error: reresolve hostname failed: can't resolve host name
92 8485 main <-- sge_error_verror() ../libs/uti/sge_error_class.c 264 }
93 8485 main <-- sge_gdi_ctx_class_error() ../libs/gdi/sge_gdi_ctx.c 517 }
94 8485 main <-- sge_gdi_ctx_class_prepare_enroll() ../libs/gdi/sge_gdi_ctx.c 965 }
95 8485 main --> sge_gdi_ctx_class_prepare_enroll() {
96 8485 main --> gdi2_reresolve_qualified_hostname() {
97 8485 main <-- gdi2_reresolve_qualified_hostname() ../libs/gdi/sge_gdi_ctx.c 2001 }
98 8485 main --> sge_gdi_ctx_class_error() {
99 8485 main --> sge_error_verror() {
100 8485 main error: reresolve hostname failed: can't resolve host name
101 8485 main <-- sge_error_verror() ../libs/uti/sge_error_class.c 264 }
102 8485 main <-- sge_gdi_ctx_class_error() ../libs/gdi/sge_gdi_ctx.c 517 }
103 8485 main <-- sge_gdi_ctx_class_prepare_enroll() ../libs/gdi/sge_gdi_ctx.c 965 }
104 8485 main --> sge_gdi_ctx_class_prepare_enroll() {
105 8485 main --> gdi2_reresolve_qualified_hostname() {
(etc.)

toward a solution: the "dot" problem

After a couple of hours more of reasoning, I've studied a bit about hostnames on linux, it is quite a mess. But the greatest problem is that *I* messed it up more than it was. The hostnames I've set up on this cluster are quite crazy: clu.01 clu.02 etc

They are stupid for two reasons: 1) contain a dot where it is not necessary (I don't need domains, it is just a small lan with no connection toward outside); 2) even with a domain, they are wrong, the domain ("clu") should come after the node unqualified name (01, 02 etc.). Now a resolver sees 16 different domains, each one with a single node named "clu".

This is the origin of the failure on qmaster startup. If in the /etc/hosts file a line like this is present:
192.168.0.1 clu.01 clu
then the qmaster daemon starts smoothly and stays alive, qping sees it, etc. etc.!
If the last "clu" (just an alias) is removed, everything is broken as before.

It is likely that the qmaster daemon sees "clu" as name of the node and looks for it during startup. The alias in /etc/hosts makes life easy; without the alias, its life is impossible.

Anyway this alias solution is not the good one for me: after succeding in installing sge on the first node, I have to install it on all other nodes (at least as execution nodes). Obviously I cannot use "clu" as alias on all the nodes. I am trying with various solutions, but at the moment nothing works: the gui_installer stops after the first 3 tasks, i.e. it fails all installations on other 15 nodes because of the error "cant resolve hostname". Sigh.

I think I have to change the name of all nodes, which makes me anxious.

Thanks again for the help

Andrea