Question :
I have a sharded database with 2 replica sets (RS1 and RS2) each one of the RSs with 2 servers. I had a problem yesterday with one member of the RS2, the mongod instance crashed throwing an error. After that I tried to recover the member making it sync with the other member of the replica set (it took a long time to finish the sync) and then I’m getting the same error again:
Tue May 7 12:37:57.023 [rsSync] Fatal Assertion 16233
0xdcf361 0xd8f0d3 0xc03b0f 0xc21811 0xc218ad 0xc21b7c 0xe17cb9 0x7f57205f2851 0x7f571f99811d
/usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xdcf361]
/usr/bin/mongod(_ZN5mongo13fassertFailedEi+0xa3) [0xd8f0d3]
/usr/bin/mongod(_ZN5mongo11ReplSetImpl17syncDoInitialSyncEv+0x6f) [0xc03b0f]
/usr/bin/mongod(_ZN5mongo11ReplSetImpl11_syncThreadEv+0x71) [0xc21811]
/usr/bin/mongod(_ZN5mongo11ReplSetImpl10syncThreadEv+0x2d) [0xc218ad]
/usr/bin/mongod(_ZN5mongo15startSyncThreadEv+0x6c) [0xc21b7c]
/usr/bin/mongod() [0xe17cb9]
/lib64/libpthread.so.0(+0x7851) [0x7f57205f2851]
/lib64/libc.so.6(clone+0x6d) [0x7f571f99811d]
Tue May 7 12:37:57.155 [rsSync]
***aborting after fassert() failure
Tue May 7 12:37:57.155 Got signal: 6 (Aborted).
Tue May 7 12:37:57.159 Backtrace:
0xdcf361 0x6cf729 0x7f571f8e2920 0x7f571f8e28a5 0x7f571f8e4085 0xd8f10e 0xc03b0f 0xc21811 0xc218ad 0xc21b7c 0xe17cb9 0x7f57205f2851 0x7f571f99811d
/usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xdcf361]
/usr/bin/mongod(_ZN5mongo10abruptQuitEi+0x399) [0x6cf729]
/lib64/libc.so.6(+0x32920) [0x7f571f8e2920]
/lib64/libc.so.6(gsignal+0x35) [0x7f571f8e28a5]
/lib64/libc.so.6(abort+0x175) [0x7f571f8e4085]
/usr/bin/mongod(_ZN5mongo13fassertFailedEi+0xde) [0xd8f10e]
/usr/bin/mongod(_ZN5mongo11ReplSetImpl17syncDoInitialSyncEv+0x6f) [0xc03b0f]
/usr/bin/mongod(_ZN5mongo11ReplSetImpl11_syncThreadEv+0x71) [0xc21811]
/usr/bin/mongod(_ZN5mongo11ReplSetImpl10syncThreadEv+0x2d) [0xc218ad]
/usr/bin/mongod(_ZN5mongo15startSyncThreadEv+0x6c) [0xc21b7c]
/usr/bin/mongod() [0xe17cb9]
/lib64/libpthread.so.0(+0x7851) [0x7f57205f2851]
/lib64/libc.so.6(clone+0x6d) [0x7f571f99811d]
Any idea of why this may be happening? How can I make this server sync and work? My last surviving server is now running as secondary, is there a way to make it primary for a while to get the data out of it?
Thanks in advance!
Answer :
You currently have a member running in secondary, because it cannot form a majority. This is why you should always have an odd number of nodes in a replica set (one can be an arbiter) and I would recommend adding a third node as soon as possible once you get things back to normal.
In terms of how to get the second node up and running, do the following:
- Shut down the remaining Secondary server (not strictly necessary, since it is read only as a Secondary, but safer)
- Now, copy the whole data directory (everything in dbpath) over to the other host
- Restart both nodes
One of the advantages to replica sets (over classic master/slave) is that they are intended to be functionally identical to each other, so you can simply use the data from one “good” node to seed any other “bad” node.