Step by step actions needs to be taken care when Informatica repositories goes down


Informatica repositories went down at 6:10 AM EST with Getting Error “”Network Error (tcp_error) “in admin-console

Actions Taken:

1)  Checked for the adminconsole whether it’s able to access or not but the adminconsole was not able to access got the following error.
""Network Error (tcp_error) A communication error occurred: "Connection refused” The Web Server may be down, too busy, or experiencing other problems preventing it from responding to requests. You may wish to try again at a later time. ""

2)  Took the screen shots for all the loads in both the Repositories.

3)  When pinged for Informatica server in other server it was showing alive.

[/apps/d001/staging/Prod]>ping is alive

4)  When checked at 6:20 AM EST the server was able to access. Checked for process which were running in server that time no process were running in server

5)  And checked for Repository database GWSCD and did the tnsping for the db the db was showing down and got the error TNS-12541: TNS:no listener.

6)  Checked with db team made the db up.

7)  But When checked uptime of the server it was showing 1:10 AM EST

The below timings show that Physical server was rebooted.
7:33am up 1:10, 4 users, load average: 0.30, 0.38, 0.36

7:37am up 1:14, 4 users, load average: 0.17, 0.29, 0.33

Sun Apr 3 07:41:31 EDT 2011

8)  A ticket should be raised with server team for asking the reason why the server was rebooted without the prior intimation.

Actions need to be taken:

Need follow-up with anyone from unix and get an RCA or validate for any hardware errors.
Following points needs to be taken care during the analysis.

1) Processes running in server
2)  tnsping and sqlplus for database
3)  Uptime of server.
3)  Screen shot for adminconsole.
4)  Major Loads Impacted needs to be noted down those needs to be recovered once the  issue get fixed.

