Description
vb_basic test in release mode.
This is for a 20483 on 64 nodes*2 cpu*4 cores.
Right now, transposition is dominant, need to see if multi-threading, by reducing the number of messages (while increasing their size) would help. mu;ti threading can esaly be applied to the host computation spots (FFT mostly).
Submission script
#!/bin/bash #OAR -n vb2048-p2p #OAR -l /nodes=64/cpu=2,walltime=20:00:00 #OAR -p gpu='NO' ulimit -c unlimited . /softs/openmpi-1.4.3-intel-11/env.sh . /softs/boost_1_47_0-intel-11/env.sh . /softs/fftw-3.3-intel-11/env.sh echo $LD_LIBRARY_PATH /softs/openmpi-1.4.3-intel-11/bin/mpiexec --mca mpi_yield_when_idle 1 --mca mpi_leave_pinned 0 --mca mpi_preconnect_mpi 1 -x LD_LIBRARY_PATH -npernode 8 -mac hinefile $OAR_FILE_NODES ../../../cubby --fft-verbose --cube-dim=2048 --fftw-planner=measure --fft-eff=speed --transposition=p2p
cubby.data
magnetic Nts 100 out_energy 10 v_check 1 velocity_samples_nb 10
The performance trace
512 Processor run, global<2048,2048,2048> for 100 time steps main timer : (real:1:1:46,user:0:43:55,sys:0:17:50)[clk:370607] loop : (real:0:57:22,user:0:40:51,sys:0:16:29)[clk:344217] (100 calls X 3.442170e+03) FFT timer : (real:0:50:11,user:0:33:14,sys:0:16:56)[clk:301102] planification time : (real:0:0:9,user:0:0:8,sys:0:0:1)[clk:955] FFT and transposition time : (real:0:50:1,user:0:33:5,sys:0:16:55)[clk:300147] (557 calls X 5.388635e+02) FFT only : (real:0:15:45,user:0:15:45,sys:0:0:0)[clk:94588] (4773 calls X 1.981731e+01) FFTW Transposition only : (real:0:33:38,user:0:16:42,sys:0:16:55)[clk:201819] (2108 calls X 9.573956e+01) azur::array timer root : (real:0:6:35,user:0:6:13,sys:0:0:21)[clk:39523] view = expr : (real:0:6:35,user:0:6:13,sys:0:0:21)[clk:39523] (1680 calls X 2.352559e+01) fftw3<dbl> id= dbl : (real:0:0:0,user:0:0:0,sys:0:0:0)[clk:11] (2 calls X 5.500000e+00) fftw4<c<dbl>> /= long int : (real:0:0:35,user:0:0:35,sys:0:0:0)[clk:3527] (208 calls X 1.695673e+01) fftw4<dbl> id= dbl : (real:0:0:0,user:0:0:0,sys:0:0:0)[clk:17] (1 calls X 1.700000e+01) fftw3<dbl> id= fftw3<dbl> : (real:0:1:19,user:0:1:16,sys:0:0:2)[clk:7925] (654 calls X 1.211774e+01) fftw4<dbl> *= dbl : (real:0:0:0,user:0:0:0,sys:0:0:0)[clk:15] (1 calls X 1.500000e+01) fftw4<c<dbl>> id= fftw4<c<dbl>> : (real:0:0:53,user:0:0:53,sys:0:0:0)[clk:5351] (202 calls X 2.649010e+01) fftw4<dbl> id= fftw4<dbl> : (real:0:0:32,user:0:0:13,sys:0:0:18)[clk:3200] (107 calls X 2.990654e+01) fftw4<c<dbl>> id= (s2v<basic3<dbl>> * ((fftw4<c<dbl>> * dbl) + fftw4<c<dbl>>)) : (real:0:0:0,user:0:0:0,sys:0:0:0)[clk:52] (1 calls X 5.200000e+01) fftw4<c<dbl>> id= (s2v<basic3<dbl>> * ((fftw4<c<dbl>> * dbl) + fftw4<c<dbl>>)) : (real:0:0:0,user:0:0:0,sys:0:0:0)[clk:49] (1 calls X 4.900000e+01) fftw4<c<dbl>> id= (s2v<basic3<dbl>> * (fftw4<c<dbl>> * dbl)) : (real:0:0:0,user:0:0:0,sys:0:0:0)[clk:34] (1 calls X 3.400000e+01) fftw4<c<dbl>> *= s2v<basic3<dbl>> : (real:0:0:0,user:0:0:0,sys:0:0:0)[clk:45] (1 calls X 4.500000e+01) fftw4<c<dbl>> += fftw4<c<dbl>> : (real:0:0:0,user:0:0:0,sys:0:0:0)[clk:33] (1 calls X 3.300000e+01) fftw4<c<dbl>> id= ((fftw4<c<dbl>> * s2v<basic3<dbl>>) + (s2v<basic3<dbl>> * (fftw4<c<dbl>> * dbl))) : (real:0:0:0,user:0:0:0,sys:0:0:0)[clk:65] (1 calls X 6.500000e+01) fftw3<c<dbl>> id= (fftw3<c<dbl>> swp(+) (fftw3<c<dbl>> + fftw3<c<dbl>>)) : (real:0:0:0,user:0:0:0,sys:0:0:0)[clk:58](4 calls X 1.450000e+01) fftw4<c<dbl>> id= ((fftw4<c<dbl>> * dbl) * s2v<basic3<dbl>>) : (real:0:0:34,user:0:0:35,sys:0:0:0)[clk:3491] (99 calls X 3.526263e+01) fftw4<c<dbl>> -= fftw4<c<dbl>> : (real:0:0:29,user:0:0:29,sys:0:0:0)[clk:2916] (99 calls X 2.945455e+01) fftw4<c<dbl>> -= ((fftw4<c<dbl>> * dbl) * s2v<basic3<dbl>>) : (real:0:0:42,user:0:0:42,sys:0:0:0)[clk:4286] (99 calls X 4.329293e+01) fftw4<c<dbl>> id= (s2v<basic3<dbl>> swp(*) (fftw4<c<dbl>> + (fftw4<c<dbl>> * dbl))) : (real:0:1:24,user:0:1:24,sys:0:0:0)[clk:8446] (198 calls X 4.265657e+01) cubby::field timer root : (real:0:36:46,user:0:19:50,sys:0:16:56)[clk:220685] scalar::transpose_blocks_when_received : (real:0:33:35,user:0:16:42,sys:0:16:53)[clk:201567] (3222 calls X 6.255959e+01) scalar::copy_transposed : (real:0:1:45,user:0:1:45,sys:0:0:0)[clk:10549] (814592 calls X 1.295004e-02) vector::in_place_curl : (real:0:0:31,user:0:0:31,sys:0:0:0)[clk:3135] (202 calls X 1.551980e+01) scalar::local_energy : (real:0:0:2,user:0:0:2,sys:0:0:0)[clk:271] (66 calls X 4.106061e+00) vector::vec_prod : (real:0:0:51,user:0:0:51,sys:0:0:0)[clk:5156] (202 calls X 2.552475e+01) vector::project : (real:0:0:38,user:0:0:38,sys:0:0:0)[clk:3826] (101 calls X 3.788119e+01) scalar::dealias : (real:0:0:58,user:0:0:58,sys:0:0:0)[clk:5892] (606 calls X 9.722773e+00) vector::div : (real:0:0:2,user:0:0:1,sys:0:0:0)[clk:252] (4 calls X 6.300000e+01) scalar::dx : (real:0:0:2,user:0:0:1,sys:0:0:0)[clk:214] (16 calls X 1.337500e+01) scalar::dy : (real:0:0:2,user:0:0:1,sys:0:0:0)[clk:220] (16 calls X 1.375000e+01) scalar::dz : (real:0:0:2,user:0:0:1,sys:0:0:0)[clk:221] (16 calls X 1.381250e+01) vector::max_abs_pos_help : (real:0:0:0,user:0:0:0,sys:0:0:0)[clk:94] (4 calls X 2.350000e+01) vector::local_max_abs_pos_help : (real:0:0:0,user:0:0:0,sys:0:0:0)[clk:50] (4 calls X 1.250000e+01) [alainm@login02 vb_basic]$
Results
energy_b
0.000000e+00 1.500000000000794e+00 1.000000e-02 1.470332161400734e+00 2.000000e-02 1.441308650819588e+00 3.000000e-02 1.412902100169330e+00 4.000000e-02 1.385088266721843e+00 5.000000e-02 1.357845578006378e+00 6.000000e-02 1.331154740654514e+00 7.000000e-02 1.304998405197896e+00 8.000000e-02 1.279360879032680e+00 9.000000e-02 1.254227880876765e+00 1.000000e-01 1.229586330981230e+00
energy_v
0.000000e+00 1.250000000000056e-01 1.000000e-02 1.177205428581816e-01 2.000000e-02 1.108648892214804e-01 3.000000e-02 1.044082833349543e-01 4.000000e-02 9.832744665994622e-02 5.000000e-02 9.260048207102142e-02 6.000000e-02 8.720678576620740e-02 7.000000e-02 8.212696607670689e-02 8.000000e-02 7.734276841312238e-02 9.000000e-02 7.283700569788115e-02 1.000000e-01 6.859349372779107e-02
Last modified 9 years ago
Last modified on Oct 28, 2011 5:35:46 PM