Difference between revisions of "Debugging after a run crashes"

From ccrmwiki
Jump to: navigation, search
 
(One intermediate revision by the same user not shown)
Line 1: Line 1:
 
The best way to find out the reason for a crash is to visualize the surface velocity with ACE/xmvis6. Usually you'll see some large/noisy velocity somewhere, which may give you some hints on forcing etc.
 
The best way to find out the reason for a crash is to visualize the surface velocity with ACE/xmvis6. Usually you'll see some large/noisy velocity somewhere, which may give you some hints on forcing etc.
  
Sometimes you want to visualize the problem right before the crash. Here is the way using the hotstart option.
+
Sometimes you want to visualize the problem right before the crash. You cannot use autocombine_MPI_elfe.pl as the last stack of output is incomplete. But you can use the core FORTRAN combine script (e.g., combine_output6) to directly combine an incomplete stack. Just follow the instruction in the header of combine_output6.f90 to prepare the inputs and run. Then visualize the combined outputs with xmvis6.
 
 
Suppose you run crashed right after time step it=1005 (you can find out this in mirror.out; note that "TIME STEP= " is written AFTER a step is completed), and the closest hotstart output (in outputs/) has a step of 900.  
 
 
 
First save any outputs that may be overwritten upon ihot=2:
 
<UL>
 
  <LI>mv mirror.out mirror.out.0
 
  <LI>mv hotstart.in hotstart.in.0
 
  <LI>mv outputs outputs.0
 
  <LI>mkdir outputs
 
</UL>
 
....
 
 
 
The third move is necessary as we are going to change the stack size (ihfskip).
 
 
 
Combine hotstart outputs at it=900 using combine_hotstart*.f90 to generate a new hotstart.in, and then move it to the same dir as hgrid.gr3.
 
Then set ihot=2 in param.in. Also set nspool and ihfskip, and hotout_write to 1005. Start the run with same number of CPUs. Occasionally, the hotstarted run will crash at a different step, say 1006, and if this is the case, reset nspool and ihfskip, and hotout_write to 1006 and redo it. The 2nd time should work.
 
 
 
You'll see 2 stacks coming out after the crash. Combine the 1st stack and then viz.
 

Latest revision as of 09:43, 30 June 2014

The best way to find out the reason for a crash is to visualize the surface velocity with ACE/xmvis6. Usually you'll see some large/noisy velocity somewhere, which may give you some hints on forcing etc.

Sometimes you want to visualize the problem right before the crash. You cannot use autocombine_MPI_elfe.pl as the last stack of output is incomplete. But you can use the core FORTRAN combine script (e.g., combine_output6) to directly combine an incomplete stack. Just follow the instruction in the header of combine_output6.f90 to prepare the inputs and run. Then visualize the combined outputs with xmvis6.