We launched new forums in March 2019—join us there. In a hurry for help with your website? Get Help Now!
    • 36816
    • 109 Posts
    clareoconsulting Reply #1, 12 years ago
    Greetings,

    In the event that someone else may find use from this, I thought I'd share what I found / did on discovery of a huge session table in the DB.

    I just created a new site for a client; one who had formerly used a (poor) discussion forum tool that used, well... oodles of unique URLs. I've eliminated that forum tool from the current site.

    New site has been up for a week (on a new host), and the host had applied a small amount of "throttling" to the site due to high system resource utilization, which was unexpected as it's getting less than 100 visitors per day per Google Analytics.

    I looked in the host's slow DB queries log and found a number of them for simple queries to the modx_sessions table. Looked at the table and found that it had 60,000+ rows after only a week online!

    That made no sense at all -- at first -- with less than 100 visitors / day, then my brain kicked in and I remembered the old forum software. I looked in the raw Apache access logs for the site and found that the site is getting hit about six times per minute by search engines -- primarily the Chinese Baiduspider (or something claiming to be) -- trying to crawl that old - now non-existent -- forum software's URLs.

    Of course, MODX was handling each of those requests and displaying a 404, but I'm guessing because the crawlers weren't accepting cookies(?) a new session was getting created for each request. Well.... yuck.

    SOLUTIION
    My solution was to create a folder for the old forum software and add .htaccess in that specific folder that let Apache return an HTTP 404 for every URL that would be from the old forum software, without MODX (and the DB) needing to get involved:

    RewriteEngine Off
    
    DirectoryIndex 404.html
    ErrorDocument 404 /obsolete-forum-directory/404.html


    I also added an entry in the robots.txt file to keep robots out of that directory, hopefully causing them to give up a bit sooner.

    Anyway -- the modx_session table is now nice and quiet, and I suspect that the host's throttling algorithm will be happier too, as will the resultant site performance.

    Hope this helps someone else with the same unusual combination of circumstances
    • Hello,

      We have also had some problems with the session table getting a little large (~40MB at peek), which truncating resolved for the time being. If I recall correctly there is a bug (fixed) that prevented garbage collection on that table. Sorry, I don't recall the specifics off hand.

      One change you can make (if you don't need sessions, aka no login area), would be to disable them entirely using sessionless contexts (http://develop.modx.com/blog/2012/04/05/new-for-2.2.1-session-less-contexts)

      Hope that helps some smiley
        Patrick | Server Wrangler
        About Me: Website | TweetsMODX Hosting
        • 36816
        • 109 Posts
        clareoconsulting Reply #3, 12 years ago
        Thanks for the input. I'd seen that blog post, and think it's a very nice performance idea, and I'll consider it in the future for other sites. This one is about to need login, so I'll leave sessions alone.

        It is a good note, however, that on a shared server Revo handled 60,000 session creations in a week, and did just fine. It's only because I watch server load reporting by the host carefully post-launch that I even noticed it.

        Trapping the ill-behaved Baiduspider crawler's attempts to reach the old forum software URLs with a pure Apache solution -- leaving MODX generation of 404's out of it -- has really settled-down the host's sense of server load (and the size of the MODX session table), so I'm happy with how it's going now.

        BTW -- The garbage collection on that table is working. In one of the slow query logs, I saw the DELETE query for rows in that table over a week old.
          • 33968
          • 863 Posts
          Don't like digging up old posts, but this is related.

          I have a 16,000 row / 115MB modx_session table (oldest timestamp from 3 days ago) and I'm convinced the majority of entries have been created by visits from bots.

          This doesn't seem like a major problem as requests to the table are fast, but I do have a lot of contexts so the data field is quite hefty. It does seem like a lot of overhead when bots don't benefit from having a session entry.

          Is there any way to prevent bots triggering rows written to the session table? I understand a mod to the session/request handler might be necessary, perhaps with a user-agent check to identify bots.