TIPS PIAF Green intermittently restarting asterisk

Dadburns

Guru
Joined
Feb 25, 2010
Messages
17
Reaction score
0
I have a PIAF Green (2.0.6.5 32-bit) implementation on a Rhino Ceros 3U with a Rhino R4T1 Card connected to two PRI circuits (1 local 1 LD).
Linux Distribution: (Redhat CentOS release 6.5 (Final))
FreePBX version: (2.10.1)
Asterisk 11.10.0
This is a production system that I brought online the 1st in "emergency mode" when the previous hardware failed leaving 60+ extensions dead. The previous system was running an old version of PIAF dating to 2011 (1.7.5.6.2). To recover I loaded a (admittedly old) backup onto a virtual PIAF 1.7.5.6.2 and then exported bulk DIDs and bulk Extension csv files to move onto the PIAF Green. It all seemed to go pretty smooth with some tweaking to get the call flow right but now that it's in production Asterisk reloads intermittently dropping all the calls and really annoying these folks.
Sometimes it will reload 3-4 times in ten minutes, sometimes it stays up for a couple of hours (it's at 1:37 up as I type). When I check the logs I can find the point that it reloads but I can't see anything consistent right before it happens, it looks like regular activity, then it reloads as in:
-----------------------------------------------
[2014-12-04 12:52:26] VERBOSE[28655][C-000000ce] pbx.c: -- Executing [s@macro-hangupcall:1] GotoIf("DAHDI/i1/3184829205-b7", "1?theend") in new stack
[2014-12-04 12:52:26] VERBOSE[28655][C-000000ce] pbx.c: -- Goto (macro-hangupcall,s,3)
[2014-12-04 12:52:26] VERBOSE[28655][C-000000ce] pbx.c: -- Executing [s@macro-hangupcall:3] ExecIf("DAHDI/i1/3184829205-b7", "0?Set(CDR(recordingfile)=)") in new stack
[2014-12-04 12:52:26] VERBOSE[28655][C-000000ce] pbx.c: -- Executing [s@macro-hangupcall:4] Hangup("DAHDI/i1/3184829205-b7", "") in new stack
[2014-12-04 12:52:26] VERBOSE[28655][C-000000ce] app_macro.c: == Spawn extension (macro-hangupcall, s, 4) exited non-zero on 'DAHDI/i1/3184829205-b7' in macro 'hangupcall'
[2014-12-04 12:52:26] VERBOSE[28655][C-000000ce] pbx.c: == Spawn extension (ext-queues, h, 1) exited non-zero on 'DAHDI/i1/3184829205-b7'
[2014-12-04 12:52:26] VERBOSE[28655][C-000000ce] chan_dahdi.c: -- Hungup 'DAHDI/i1/3184829205-b7'
[2014-12-04 12:52:31] Asterisk 11.10.0 built by root @ pbx.local on a i686 running Linux on 2014-10-23 14:46:45 UTC
[2014-12-04 12:52:31] VERBOSE[29744] config.c: == Parsing '/etc/asterisk/asterisk.conf': Found
[2014-12-04 12:52:31] VERBOSE[29744] manager.c: == Manager registered action DBGet
[2014-12-04 12:52:31] VERBOSE[29744] manager.c: == Manager registered action DBPut
[2014-12-04 12:52:31] VERBOSE[29744] manager.c: == Manager registered action DBDel
[2014-12-04 12:52:31] VERBOSE[29744] manager.c: == Manager registered action DBDelTree
------------------------------------------
Anyone have any ideas on this? At this point they are burning me in effigy and I fear it won't be long till they want to do it for real.
 

john p

Member
Joined
Jul 9, 2013
Messages
82
Reaction score
6
I have two ideas to try. First, how long has the server been up & does top show possible RAM/CPU usage of concern? At least I'd reboot the box. Second, have you increased the verbosity to at least 7 and kept a terminal session open so you see what the system is reporting in real time - i.e., could the log be missing stuff you'd see on screen, lost in the crash? If practical, I'd consider replacing the RAM and I've seen a stick going bad produce such wierd effects. Good luck!
 

Dadburns

Guru
Joined
Feb 25, 2010
Messages
17
Reaction score
0
Hmm... Thanks for that John P. The server has been up since the 1st, but we have rebooted several times including a power cycle. The memory and CPU usage show a reasonable balance with the CPU loafing at an average of 0.27 and the memory hovering in the 30-55% area depending on call load. This evening I have been trying to troubleshoot further, including inducing a call loop through another server so I could pile the load on and did indeed find that I could pressure the system into failure. I am driving two hours to get back onsite in the morning with another server; a slightly newer one with a faster CPU and double the RAM plus a fresh install of PIAF Green. Your thoughts on upping the verbosity are good, I'm going to try that too. I'm also going to wear my asbestos underwear... what with the whole burning in effigy thing going on.

Again, thanks for your input, I will post follow-up tomorrow if they let me live.
 

atsak

Guru
Joined
Sep 7, 2009
Messages
2,385
Reaction score
439
I've seen seg faults etc on PRI cards when the DAHDI / Kernel / Driver chemistry isn't right. . . . maybe check with Rhino on compatibility of versions?
 

Dadburns

Guru
Joined
Feb 25, 2010
Messages
17
Reaction score
0
Well, I have returned from the wars... I replaced the server with the original more powerful unit (failed HD replaced) with a fresh install of PIAF Green, moved a backup from the faulty system to the new and off to the races we went.... and the exact problem continues. The only item maintained from one to the next was the Rhino R4t1 card (that was working fine up until the HD died on the original system). I have checked and am running the latest driver from Rhino which they say is good with Asterisk 11.10.
Today I have been digging into the core dumps found in /tmp and am seeing these two results on different failures with the "signal 6" one more common:
---------------------------------------
Core was generated by `/usr/sbin/asterisk -f -U asterisk -G asterisk -vvvg -c'.
Program terminated with signal 6, Aborted.
#0 0x00c72424 in __kernel_vsyscall ()
"/tmp/core.asterisk.com-2014-12-05T15:46:04-0600" is a core file.
----------------------------------------
Core was generated by `/usr/sbin/asterisk -f -U asterisk -G asterisk -vvvg -c'.
Program terminated with signal 11, Segmentation fault.
#0 0x005b5cc3 in ?? ()
"/tmp/core.asterisk.com-2014-12-05T16:26:11-0600" is a core file.
----------------------------------------

I'm still not sure where to go on this, I'm considering removing the R4t1 card, moving the trunks to another server on the network and routing calls to/from this server via IAX2. If that resolves the problem then I can look at the card, if it doesn't then I will consider moving them back to the PIAF version that was working on this hardware.

Other things I have tried:
We noted that when we had made changes via FreePBX the restarts seemed to follow shortly after, so no changes for a while, and it did it again anyway.
We noted that just before the restarts the webserver status indicator would go yellow, coupled with the possibility of FreePBX being related I have run the system with httpd off, just Asterisk by itself and again the restarts occurred.
 

TwigsUSAN

Guru
Joined
Apr 7, 2011
Messages
215
Reaction score
24
Could it be importing the 1.7.X.X version into Asterisk 11. I thought you couldn't do that.
 

hecatae

resident hecatae
Joined
Feb 7, 2014
Messages
765
Reaction score
200
Could it be importing the 1.7.X.X version into Asterisk 11. I thought you couldn't do that.


looks like a 1.7 backup was not imported, just stripped of relevant info:

To recover I loaded a (admittedly old) backup onto a virtual PIAF 1.7.5.6.2 and then exported bulk DIDs and bulk Extension csv files to move onto the PIAF Green.

Would there be an issue installing piaf3 from the piaf3 install script? http://pbxinaflash.com/community/index.php?resources/piaf3-install.38/
 

Dadburns

Guru
Joined
Feb 25, 2010
Messages
17
Reaction score
0
Many thanks for everyone's input.
TwigsUSAN, I didn't load the 1.7 backup onto Asterisk 11. I put the backup onto virtual instance of piaf 1.7 to and used it to output .csv files and as a visual guide for my setup on Asterisk 11. but thanks for the thought.
hecatae, I used the default iso to create an install disk for PIAF 2.0.6.5 so maybe that install script you referenced could be a help, I haven't had a chance to try it.

Really my problem was that this is a very active production system in a (relatively) large Asterisk network and that created a lot of pressure to fix it immediately without a lot of opportunity to research/test. This particular server was providing voice services to 60+ users in the Administration department for a large resort company based in Branson Mo. and of course the admin people included the CEO, CFO, etc. etc.... namely the folks with the least tolerance for intermittent problems.

On the other hand, this customer having a network of servers created an opportunity to get them back online; there was another server in the same rack that had two PRI ports available so over the weekend I was able to move the PRIs, copy all extensions, routes, IVRs..... basically recreate the problematic system on this other server running the older PIAF implementation and get the users back online. I will recover the problem server from Branson and get it back to my lab where I can do some further testing without the pressure of user needs holding me back. When and if I figure out what the problem is I will post it here.

Again, thanks to everyone who popped in with suggestions.
 

Dadburns

Guru
Joined
Feb 25, 2010
Messages
17
Reaction score
0
Well, I have found the culprit and the work-around for this issue and frankly it's (as usual) my own fault and the fix falls into the category I call "The Dumb-A$$ Switch". That's when you spend way too much time on a problem and eventually say "Hey, what's this switch here?!? Flip it and the problem is solved." In my lab when you're struggling with a problem, a common question is "Did you look for the D.A. switch?"

The cause: This customers network has six PIAF servers and four Xorcom servers (running their variant of Elastix) servicing about 1100 endpoints. These are all connected via IAX2 trunks over their WAN. Several of these servers don't have any local PSTN connections and utilize the PRIs on the two servers I have been referencing in this thread via their IAX2 connections. There is a lot of traffic over these IAX2 trunks, and due to the sins of "dial-plan float" (in which users moving from one location to another have been allowed to keep their extension numbers). there is some pretty complex routing between these servers. A common outbound route on most of them is basically if a 4-digit (extension number) is dialed that is unknown to the hosting server, send it to the primary switching server, which may then send it to the secondary switching server (the two boxes with the PRIs in).
Again, due to the sins of dial-plan float those two switching servers may send that call right back to the server that originated it creating a loop on that IAX2 trunk. If people would just stay put in this company it would be so much easier! Anyway, the pre-existing PIAF servers running Asterisk 1.4 FreepBX 2.6 were OK with this; the call count would jump up really high until whoever placed the call would hang up..... On the new PIAF server I put in as one of the two primary switching servers (Asterisk 11.10 FreePBX 2.10) it would act the same, but, depending on how high the call count went, Asterisk would crash (sooner if the call count was higher, later if it was lower but it seemed to need to go past 200 or so to happen). And that was what caused it to be intermittent; people didn't dial those loop back extensions very regularly and some got inpatient and hung up before the loop count got too high. I was able to duplicate this in my lab with both Asterisk 11.10 and 1.8. (haven't tried 1.6) 1.4 doesn't respond that way.

The fix: (the D.A. switch) Cap the number of calls per IAX2 trunk in FreePBX, gosh there's this handy field for it right there in the trunk config screen! Yes, I know I should have been doing that all along but as I said earlier in this post there is a lot of traffic over these trunks, so I had left them wide open instead of estimating their usage and capping them. All done now, problem solved (after too many hours spent on it).

...In an unrelated question; Can anyone tell me where in my forum user profile to modify the title i.e. guru, or member..? I think I need to change mine to D.A.

And one more time: Thanks for everyone's input on this. The help I have received in these forums has made all the difference many times.
 

hecatae

resident hecatae
Joined
Feb 7, 2014
Messages
765
Reaction score
200
ouch that is painful,

I managed the same trick 4 months back, managing to loop a call 330 times a second, 180,000 calls in total before I maxed out the server bandwidth, yes asterisk did not crash the server, the bandwidth went first.
 

Members online

No members online now.

Forum statistics

Threads
25,810
Messages
167,755
Members
19,240
Latest member
nikko
Get 3CX - Absolutely Free!

Link up your team and customers Phone System Live Chat Video Conferencing

Hosted or Self-managed. Up to 10 users free forever. No credit card. Try risk free.

3CX
A 3CX Account with that email already exists. You will be redirected to the Customer Portal to sign in or reset your password if you've forgotten it.
Top