We have been live on 10.0.7.4 since November 2016 - it’s s a buggy version but we have no maintenance so can’t upgrade. System performance is typically pretty good - been through all the usual stuff with caching screens etc to speed things up. We have 20 shop floor data collection users and 20 back office folks with around 7-10 of them using the main Epicor system. Everyone is based at one site and directly connecting to epicor (no terminal server)
Both app and sql are virtualised on VMware - sql has 24gb ram with 4 multi core cpu’s, app has 24gb ram with 4 multi core cpu’s. The physical hosts that the vm’s run on only run these vm’s.
We have noticed over the last few weeks that at random periods, say once a day, the shop floor MES and some screens in the main application, particularly PO become extremely sluggish when moving between fields or reading data. Lasts for a few minutes and it clears itself. I’ve been tracking performance on the servers both within windows and vsphere and when this happens the servers are barely being doing anything so it is nothing obvious like running out of cpu/ram/disk read writes. Epicor performance and diagnostics tool comes up clean as well.
There is no pattern it can happen at any time of the day, including back shift/night shift when there are no office staff using the full client and frequently when it is reported I and others who are using other screens don’t have an issue. I can replicate on my machine/server on the affected screens when reported and I do nothing to fix (no IIS restart, no start and stopping services) - whilst I am trying to establish what might be wrong it just fixes itself and performance is fine again for a few days.
I have checked all the obvious things - no Epicor tasks running, no backups, no queries/reports running at the time, anti-virus not an issue and I have also turned off any bpm’s on the po and labour entry screens as a process of elimination and issue still happens at some random point. Happens on a simple job/simple po. Also not a sign of any underlying network issue as other systems/servers all seem to be working ok.
Regards the shop floor data collection, I have made sure that there is nobody left clocked on or with a long running labour booking.
I appreciate this is a wee bit like saying my car is running slow, no warning lights on, no steam coming from the bonnet and then expecting someone to tell me exactly what is wrong but has anyone seen something like this before and/or can you suggest alternative things I can try and do, bearing in mind we have no access to support.
Check blocking and deadlocks first in SQL. Also look at your wait stats to see what the SQL instance is waiting and struggling with. That should give you a starting point. Also check you statistics for your indexes too. Chasing fragmentation will cause you headaches, but keep that in mind also.
We were having significant memory issues with our SQL server usage back when we were on 10.1.600.5 that have completely gone away in our patch to 10.1.600.20.
We never could figure out why our test database with no user in it was running at like 80%-90% usage. Then our production was frequently getting up to 95% usage and then people opening programs with large customizations were having difficulties and we’d do a reboot of the server. We found we had to take away other instances of our database from developing (i.e., test2, training) to reduce SQL server usage. It was pretty irritating as it’s a completely new server with install back in August 2018. The (Epicor platinum partner) consultant who did the install and migration couldn’t see anything wrong with it either.
Since new patch it’s completely gone.
I saw this fella’s post and it made me think of our issue and perhaps the suggestions in the exchange might help you too.
@EpicorAnon has a couple good points. You should also check to make sure your database is not auto-growing during peak hours. Depending on the current size and the auto-grow size this can definitely cause slowdowns. The solution - right size the database so auto-grow is more of an emergency solution. I typically recommend having a minimum of 35% free space within the log and data files.
Those auto growths will get you too. We had a similar problem. It also may be environmental (network,storage.etc.) outside of Epicor. You can use TCPING to test your latency between you app and DB servers. Also the Epicor performance monitoring tool could be useful here to see what activity is happening during those slow times.
Which HW Version of VMWare are you on?
Which vSphere Version do you have installed?
What type of Host do you have UCS Cisco or Dell or HP?
NetApp Storage, SSDs?
We had extreme slowness across the board and for us there were several things that helped.
Make sure your Firmware’s are up-to-date with all the recent Security Flaws and Massive Updates your Vendors more than likely have firmwares. Get on the latest Firmware! (We had to update our UCS Cisco Firmwares)
Are both SQL and the APP guests on the same physical host? When the system is slow have you tried logging directly in on the app server to see if the performance is the same. if you run the PDT performance test and config check does it look within spec?
All good ideas - adding another one - we had network issues at a couple of customers where client to server connections were being reset constantly. We’ve added a retry counter to the client trace in future versions for that reasons. It was driving us nuts until we tried adding the trace and voila, 5, 7, 19 retries to make a single server call for periods of time during the day. They went off and located the issue in their network.
PS - I know you don’t want to hear it and I doubt you control the purse strings. 10.0 versus 10.1 or 10.2 has advanced a ton - especially in a lot of reliability areas. 10.0.700.4 was shipped in May 2104 so is starting to show it’s age. Getting back on maintenance as you can definitely recommended but I understand you need to keep the lights on at your business and probably have competing interests to fund. Good Luck!
I’m assuming this probably isn’t the case since you said this can happen during night when the office staff isn’t present, but I would look into your current count of Epicor licenses being used when the problem occurs. It might be possible that many people have more than one instance of Epicor open and are reaching your max capacity. We’ve had this situation happen in Epicor 9 and experienced slowed wait times for processes until the in-use license count dropped down.
Sometimes the symptom of “Random” slowness has nothing to do with where you see it… Case in point… customer reported: “Every day, sometime in the afternoon, Epicor gets really slow… sometimes it happens in the morning as well”… after some investigation, we found that it was not just Epicor… their email also got slow… turned out that at the very same time as the system got slow, someone in the company had kicked off a backup of all their engineering data across the network literally consuming all network bandwidth.
Update on this - issue still happening. Followed some of the suggestions on here - sql server and databases seem to be in rude health and servers/VMware all seems to be in good shape. Database is set to autogrow automatically but has tonnes of space so that does not appear to be happening.
E10 app and DB are virtualised on the same physical host and use a virtual switch so effectively directly connected. Backups run out of hours and I can replicate the slowness but I have confirmed backups not running at any other time.
What I have noticed if I access the Epicor client at the same time from the app server, there is no issues with performance. I brought up a windows desktop vm on the same physical host and did the same test and again Epicor client/MES work without issue. As several folks mentioned this would appear to be external to the Epicor environment. When I find out what it is I will let you know.
Are you running any logs to monitor perf at app server or client? The newer client logging has measurement for server, network transfer time, reconnects, etc.
When you say rude health are you meaning rudely good or bad. Sorry, the sarcasm is not coming thru in text
Another update as I am pretty certain that this is a network communications issue rather than sql server etc. As mentioned before we have built a windows client vm on the same host as the Epicor app and db server using the virtual switch. Whenever we see slowness we test on this and without fail it is reading and writing without delay, whilst all other clients are slow.
We have another physical host running our other non-Epicor vm’s. This is connected to the same physical switch as the other physical host. If we bring up a windows client vm on this host it has the same slowness as all of the other clients.
I have been running wireshark on a client to analyse network traffic when we see the slowness and thus far there is no obvious gotcha in amongst the screeds of logs that wireshark generates. Should I be running this on the app server, rather than a client and if so is there any particular protocols or packet types that I should focus on?
As part of normal IT activity we replaced one of the users PC’s last week - they are one of the heaviest users of the main Epicor client. Since then we have had no reported slowness. The user on that PC reported an Epicor issue this morning - they weren’t getting every row returned when searching - they were only getting 100. On their old PC, we had set DefaultSearchPageSize=10000 rather than the default of 100 in the Epicor config file.
Could this be they cause of the random slowness - every time they searched they were potentially bringing back 1000’s of records - I’m not an SQL server architect so don’t know the specific terms - but I assume that there is some kind of caching of queries/data and that there will be an associated helper/hygiene process that is triggered when that cache is full or query/data is considered stale - could this explain the randomness of this and also why it could happen out of normal hours?
In the meantime, the user has indicated that they need to see more than 100 records on some screens, I have increased to 1000 and will see if the system slows down again and adjust accordingly.
I assume DefaultSearchPageSize is set to 100 for a reason - is there any guidance on system impact due to increasing this parameter - to confirm we had only made this change for a single user.
I feel like that wouldn’t be the issue, but I suppose it might be possible depending on how many records these searches need to go through. To check, where exactly are these searches taking place? Is this in areas like searching for job records in job entry or somewhere else entirely?
Job entry, sales order entry and purchase order entry and the trackers - in the case of purchase order entry, I think the search displays in descending order with the most recent order at the top and the filter for open, approved etc is not always used so there will be regular occasions where every single purchase order on the system is being returned by a search.
In principle I am dubious that this would be the cause, but given it is the only thing that has changed and the slow down hasn’t happened for best part of a week I can’t discount this and the reason why I am asking.