#464 closed defect (fixed)
drakma / cl+ssl failure
Reported by: | charmon | Owned by: | Mark Evenson |
---|---|---|---|
Priority: | blocker | Milestone: | 1.7.0 |
Component: | libraries | Version: | 1.6.0-dev |
Keywords: | cl+ssl, ssl, drakma | Cc: | |
Parent Tickets: |
Description (last modified by )
c.f. <https://github.com/armedbear/abcl/issues/127>
This is probably a cl+ssl/cffi issue rather than a drakma issue, but when I try to do, e.g.,
(drakma:http-request "https://common-lisp.net/")
I get the following error:
A failure in the SSL library occurred on handle #<com.sun.jna.Pointer native@0x7fa96089a760 {41501C8A}> (return code: 1). SSL error queue: error:20080078:BIO routines:bio_write_intern:uninitialized [Condition of type CL+SSL::SSL-ERROR-SSL] Restarts: 0: [RETRY] Retry SLIME REPL evaluation request. 1: [*ABORT] Return to SLIME's top level. 2: [ABORT] Abort thread. Backtrace: 0: (#<FUNCTION {63F35033}> #<CL+SSL::SSL-ERROR-SSL {5769AA74}> #<FUNCTION {63F35033}>) 1: (APPLY #<FUNCTION {63F35033}> (#<CL+SSL::SSL-ERROR-SSL {5769AA74}> #<FUNCTION {63F35033}>)) 2: (SYSTEM::RUN-HOOK SYSTEM::*INVOKE-DEBUGGER-HOOK* #<CL+SSL::SSL-ERROR-SSL {5769AA74}> #<FUNCTION {63F35033}>) 3: (INVOKE-DEBUGGER #<CL+SSL::SSL-ERROR-SSL {5769AA74}>) 4: (ERROR CL+SSL::SSL-ERROR-SSL :HANDLE #<com.sun.jna.Pointer native@0x7fa96089a760 {41501C8A}> :RET 1 ...) 5: (CL+SSL:MAKE-SSL-CLIENT-STREAM #<TWO-WAY-STREAM {50E6169D}> :VERIFY NIL :HOSTNAME "common-lisp.net" ...) 6: (#<FUNCTION {4A462CFF}>) 7: (CL+SSL::CALL-WITH-GLOBAL-CONTEXT #<com.sun.jna.Pointer native@0x7fa9609829d0 {BB3CD71}> NIL #<FUNCTION {4A462CFF}>) 8: (DRAKMA::MAKE-SSL-STREAM #<TWO-WAY-STREAM {50E6169D}> :HOSTNAME "common-lisp.net" :CERTIFICATE NIL ...) 9: (DRAKMA:HTTP-REQUEST "https://common-lisp.net/") 10: (SYSTEM::%EVAL (DRAKMA:HTTP-REQUEST "https://common-lisp.net/")) 11: (EVAL (DRAKMA:HTTP-REQUEST "https://common-lisp.net/"))
Attachments (3)
Change History (36)
comment:1 Changed 6 years ago by
Component: | compiler → libraries |
---|---|
Milestone: | → 1.6.0 |
comment:2 Changed 6 years ago by
Update to latest stable jna-5.3.1 locally, then testing OpenJDK8 shows failure to be platform dependent.
- FreeBSD -- segmentation violation; JVM crashes; no Lisp trace
- macOS -- Null pointer exception with the following stacktrace
8: (JCALL-RAW #<java.lang.reflect.Method public java.lang.Object com.sun..... {6E71BAF0}> NIL #<java.lang.Class class com.sun.jna.Pointer {66A9D1AB}> #<jarray [Ljava.lang.Object;@2ffedc49 {7D7AEA62}>) Locals: EXTENSIONS::|method-ref| = #<method public java.lang.Object com.sun.jna.Function.invoke(java.lang.Class,java.lang.Object[])> EXTENSIONS::|instance|#1 = NIL EXTENSIONS::|&rest|#2 = #<java class com.sun.jna.Pointer> EXTENSIONS::|args|#3 = #<jarray [Ljava.lang.Object;@2ffedc49 {7D7AEA62}> 9: (CL+SSL:ENSURE-INITIALIZED) 10: (CL+SSL:MAKE-CONTEXT :VERIFY-DEPTH 10 :VERIFY-MODE 0 :VERIFY-CALLBACK ...)
comment:3 Changed 6 years ago by
Owner: | set to Mark Evenson |
---|---|
Status: | new → assigned |
comment:4 Changed 6 years ago by
Somewhere along the line, abcl-1.6.0-dev has picked up 7 unexpected failures in the CFFI tests:
7 unexpected failures: FUNCALL.VARARGS.NOSTDLIB, FUNCALL.VARARGS.CHAR, FUNCALL.VARARGS.INT, FUNCALL.VARARGS.LONG, FUNCALL.VARARGS.DOUBLE, FUNCALL.VARARGS.STRING, TEST-ASDF-LOAD
comment:5 Changed 5 years ago by
FWIW, going back in time to ABCL
fd87311d * @ Update to jna-4.2.2
and CFFI
600f440 * @ Revert "Mutex on hash table for cffi-libffi"
and running (asdf:test-system 'cffi) yields:
3 out of 308 total tests failed: FSBV.WFO, FSBV.MAKEPAIR.1, FSBV.MAKEPAIR.2.No unexpected failures.
while running that version of ABCL with current cffi HEAD yields:
10 out of 325 total tests failed: FUNCALL.VARARGS.NOSTDLIB, FUNCALL.VARARGS.CHAR, FUNCALL.VARARGS.INT, FUNCALL.VARARGS.LONG, FUNCALL.VARARGS.DOUBLE, FUNCALL.VARARGS.STRING, FSBV.WFO, FSBV.MAKEPAIR.1, FSBV.MAKEPAIR.2, TEST-ASDF-LOAD. 7 unexpected failures: FUNCALL.VARARGS.NOSTDLIB, FUNCALL.VARARGS.CHAR, FUNCALL.VARARGS.INT, FUNCALL.VARARGS.LONG, FUNCALL.VARARGS.DOUBLE, FUNCALL.VARARGS.STRING, TEST-ASDF-LOAD.
So I think some things got broken along the way in CFFI. I can bisect.
comment:6 Changed 5 years ago by
Looks like this is the culprit:
Head: 0d2b3c0 tests: funcall: call sprintf with foreign-funcall-varargs Tags: v0.19.0 (4), v0.20.0 (25) 0d2b3c0067529a9862c4c967158b75b040d1d8f8 is the first bad commit commit 0d2b3c0067529a9862c4c967158b75b040d1d8f8 Author: Daniel Kochmanski <daniel@turtleware.eu> Date: Sun Oct 1 14:43:17 2017 +0200 tests: funcall: call sprintf with foreign-funcall-varargs This function takes variable number of arguments and we have abstraction for that. :040000 040000 bbdb0b48f2d5d8fc9fde1a01b8955f88e83066d2 a5825c1e6228c3f9a7a0e41ebbe873c5c1622fdb M tests Bisect Rest (1) 0d2b3c0 * bad @ tests: funcall: call sprintf with foreign-funcall-varargs
comment:7 Changed 5 years ago by
Ah, but of course that's just exposing the broken behavior from the tests, so who knows when the actual breakage came in.
comment:8 Changed 5 years ago by
Ok, so I managed to get things to build with current ABCL, cffi, and drakma, and mostly-current cl+ssl. The problem is that pointing cl+ssl at libssl.so.1.1 breaks things:
commit fde3b207cb61f5e23a40f75a733c4a2b31a69dd1 Author: Anton Vodonosov <avodonosov@yandex.ru> Date: Mon Feb 4 01:20:37 2019 +0000 Actually use Openssl-1.1, if available.
Is the offending commit. If we go back to openssl-1.0, then things work again. So there are probably multiple problems. Why does openssl-1.1 break things? Why does this trigger the various NPEs/segfaults seen above? etc...
comment:9 Changed 5 years ago by
#+begin_example
CL+SSL> (default-ssl-method)
TLS-METHOD
#+end_example
#+begin_src lisp
(setf *ssl-global-method* (funcall (or method (default-ssl-method))))
(setf *ssl-global-context* (ssl-ctx-new *ssl-global-method*))
#+end_src
* Run the CL+SSL test suites
Invocation via FIVEAM:
#+begin_example
(fiveam:run-all-tests)
#+end_example
Add invocation to ASDF:
(swank:operate-on-system-for-emacs "cl+ssl.test" 'test-op) Running test suite CL+SSL Running test suite CL+SSL.SNI Running test suite CL+SSL.VERIFY-HOSTNAME Running test VERIFY-GOOGLE-CERT-PRINTABLE-STRING . Running test VERIFY-HOSTNAME-FAIL ......... Running test VERIFY-GOOGLE-CERT-DNS-WILDCARD . Running test VERIY-HOSTNAME-SUCCESS .............. Running test VERIFY-GOOGLE-CERT-TELETEX-STRING . Running test VERIFY-GOOGLE-CERT-UNIVERSAL-STRING . Running test VERIFY-GOOGLE-CERT-BMP-STRING . Running test VERIFY-GOOGLE-CERT . Running test VERIFY-GOOGLE-CERT-WITHOUT-DNS . Running test SANITY-CHECK.2 . Running test SANITY-CHECK.1 . Running test COMPAT-OPENSSL-VERSION . Running test suite CL+SSL.BADSSL-COM Running test WRONG.HOST ..Xf Running test EXPIRED ..Xf Running test UNTRUSTED-ROOT . Running test SELF-SIGNED . Did 41 checks. Pass: 39 (95%) Skip: 0 ( 0%) Fail: 2 ( 4%) Failure Details: -------------------------------- EXPIRED []: Unexpected Error: #<JAVA:JAVA-EXCEPTION java.lang.NullPointerException {59A01E02}> Java exception 'java.lang.NullPointerException'... -------------------------------- -------------------------------- WRONG.HOST []: Unexpected Error: #<JAVA:JAVA-EXCEPTION java.lang.NullPointerException {2F1F3092}> Java exception 'java.lang.NullPointerException'... -------------------------------- (:COMPILATION-RESULT NIL T 4.431 NIL NIL) CL-USER>
comment:10 Changed 5 years ago by
Here's a minimal varargs test that fails (from the CFFI test suite):
(asdf:test-system "cffi") (CFFI::%FOREIGN-FUNCALL-VARARGS "sum_double_arbitrary" (:INT 1) (:DOUBLE 3.14d0 :DOUBLE))
comment:11 Changed 5 years ago by
Ok, I'm down a rathole of multiple failures. I've fixed the vararg problem (more on that later). But it looks like the cl+ssl version checking stuff is completely bogus. Somehow, we open:
/usr/local/Cellar/openssl/1.0.2t/lib/libssl.1.0.0.dylib
but then the version checking stuff thinks that we have openssl 1.1 installed and tries to call openssl 1.1 functions and fails.
comment:12 Changed 5 years ago by
So both SBCL and ABCL use the following (on my system):
/usr/local/Cellar/openssl/1.0.2t/lib/libssl.1.0.0.dylib
But on SBCL, I get:
CL-USER> (format nil "~X" (cl+ssl::ssl-eay)) "1000214F"
and on ABCL, I get:
CL-USER> (format nil "~X" (cl+ssl::ssl-eay)) "20000000"
I think this is somehow making the cl+ssl stuff think that we're using post version 1.0 and it tries to call a function that doesn't exist.
comment:13 follow-up: 15 Changed 5 years ago by
Success! After a fashion anyway... Need to do:
@@ -125,7 +125,7 @@ (defmacro define-crypto-function-ex ((&key since vanished) name-and-options &bod *cl+ssl-crypto-foreign-function-names* :test 'equal) (defcfun-versioned (:since ,since :vanished ,vanished) - ,(append name-and-options #+(and lispworks darwin) '(:library libcrypto)) + ,(append name-and-options #+(and (or abcl lispworks) darwin) '(:library libcrypto)) ,@body))) (defmacro define-crypto-function (name-and-options &body body)
Will post a PR to cl+ssl.
comment:14 Changed 5 years ago by
Oh, and I fixed the ABCL varargs thing while I was at it, although this was probably a red herring.
comment:16 follow-up: 18 Changed 5 years ago by
I fixed the ABCL varargs thing while I was at it
<https://github.com/slyrus/cffi/commit/1cfc67dd367fc669875ce55d46c29e7d16b0c7c3>
comment:17 Changed 5 years ago by
Priority: | major → blocker |
---|
Ensure that we get this mess sorted out for next release.
comment:18 follow-up: 19 Changed 5 years ago by
Replying to Evenson Not Org:
I fixed the ABCL varargs thing while I was at it
<https://github.com/slyrus/cffi/commit/1cfc67dd367fc669875ce55d46c29e7d16b0c7c3>
Any reason we don't get a pull request in for CFFI for this?
comment:19 Changed 5 years ago by
Replying to Evenson Not Org:
Replying to Evenson Not Org:
I fixed the ABCL varargs thing while I was at it
<https://github.com/slyrus/cffi/commit/1cfc67dd367fc669875ce55d46c29e7d16b0c7c3>
Any reason we don't get a pull request in for CFFI for this?
With this patch, 1.6.0-dev
compiled carefully with OpenJDK_64-Bit_Server_VM-AdoptOpenJDK-1.8.0_232-b09
on x86_64-Mac_OS_X-10.14.6
doesn't seem to SEGV, although the CL+SSL.TEST
invocation has some disturbing complaints on uninitialized memory in the Block IO (BIO) routines.
CL-USER> (fiveam:run-all-tests) Running test suite CL+SSL Running test suite CL+SSL.BADSSL-COM Running test WRONG.HOST ..Xf Running test UNTRUSTED-ROOT . Running test EXPIRED ..Xf Running test SELF-SIGNED . Running test COMPAT-OPENSSL-VERSION . Running test suite CL+SSL.VERIFY-HOSTNAME Running test VERIY-HOSTNAME-SUCCESS .............. Running test VERIFY-GOOGLE-CERT-WITHOUT-DNS . Running test VERIFY-GOOGLE-CERT-TELETEX-STRING . Running test VERIFY-GOOGLE-CERT-PRINTABLE-STRING . Running test VERIFY-GOOGLE-CERT-BMP-STRING . Running test VERIFY-GOOGLE-CERT-DNS-WILDCARD . Running test VERIFY-GOOGLE-CERT . Running test VERIFY-GOOGLE-CERT-UNIVERSAL-STRING . Running test VERIFY-HOSTNAME-FAIL ......... Running test SANITY-CHECK.1 . Running test SANITY-CHECK.2 . Running test suite CL+SSL.SNI Did 41 checks. Pass: 39 (95%) Skip: 0 ( 0%) Fail: 2 ( 4%) Failure Details: -------------------------------- EXPIRED []: Unexpected Error: #<CL+SSL::SSL-ERROR-SSL {AF89B4E}> A failure in the SSL library occurred on handle #<com.sun.jna.Pointer native@0x7fae7eeae400 {63DDE5C2}> (return code: 1). SSL error queue: error:20080078:BIO routines:bio_write_intern:uninitialized .. -------------------------------- -------------------------------- WRONG.HOST []: Unexpected Error: #<CL+SSL::SSL-ERROR-SSL {255828B}> A failure in the SSL library occurred on handle #<com.sun.jna.Pointer native@0x7fae7b045400 {1E7233FF}> (return code: 1). SSL error queue: error:20080078:BIO routines:bio_write_intern:uninitialized .. --------------------------------
The SEGV seems to be definitely happening around the use openjdk11 to compile or run the code. My "Heisenbugs" were presumably the result of mixing the fasls from openjdk11 output.
comment:21 Changed 5 years ago by
For better or worse, it fails on JDK 8 for me too now on my linux box.
comment:22 Changed 5 years ago by
charmon has submitted an upstream pull for CFFI <https://github.com/cffi/cffi/pull/149>
comment:23 Changed 5 years ago by
I'm still suffering from the feeling that we are dealing with some sort of Heisenbug around the memory allocation/initialization strategy of openssl 1.1.
Running the cl+ssl.test
suite against released abcl-1.5.0 binaries which uses jna-4.2.2 segfaults on macOS, has the two test failures with bio_write_intern:uninitialized
on Linux, and segfaults on freebsd. And sometimes abcl-1.6.0-dev seems to work on macOS, but I can't reliably reproduce and/or isolate why exactly.
The space of potential variables here is large:
- ABCL version (using 1.5.0 binaries, or 1.6.0-dev compiled from HEAD)
- OS (macOS fails "sometimes", Linux "sometimes", has always failed on FreeBSD, Windows untested)
- JVM platform (testing with openjdk8 and openjdk11 doesn't seem to make a difference)
- OpenSSL version (abcl-1.5.0 was released when openssl 1.0.x/1.1.y were in use)
- jna version (abcl-1.5.0 used jna-4.2.2; abcl-1.6.0-dev has jna-5.5.0 but "downgrading" to jna-4.2.2 doesn't seem to show a difference)
- CFFI version (presumably better with the varargs patch)
My current strategy towards working on this:
- Get CI artifacts into source tree so we can run the tests in a much more controlled manner. I will work on using Github/Travis? as the first target, as that seems easiest for me at the moment
- Figure out how to run the JVM under GDB so that we can inspect the smashed stack, and hopefully set a breakpoint before it happens. Not sure if this is entirely possible in 2019 without a whole lot of elbow grease.
comment:24 Changed 5 years ago by
One other variable, or at least value to consider along the openssl axis, is +/- libressl.
comment:25 Changed 5 years ago by
Hmm. Both openjdk8 and openjdk11 work on the xenial distribution available to <https://travis-ci.org/easye/abcl/jobs/613567826>
Maybe somehow dealing with "static" memory initialization if openssl is used more than once? That shouldn't be able to happen, but maybe multiple JVMs are somehow keeping memory around?
comment:27 Changed 5 years ago by
Under one of the Travis builds <https://travis-ci.org/easye/abcl/jobs/616239946>, I notice
Running test suite CL+SSL.BADSSL-COM # # A fatal error has been detected by the Java Runtime Environment: # # SIGSEGV (0xb) at pc=0x00007fff76853e50, pid=10069, tid=0x0000000000004a03 # # JRE version: OpenJDK Runtime Environment (8.0_232-b09) (build 1.8.0_232-b09) # Java VM: OpenJDK 64-Bit Server VM (25.232-b09 mixed mode bsd-amd64 compressed oops) # Problematic frame: # C [libssl.35.dylib+0x3be50] SSL_CTX_set_default_verify_paths+0x10
comment:28 Changed 5 years ago by
Description: | modified (diff) |
---|
comment:29 Changed 5 years ago by
With the most recent abcl, I see the following error:
# A fatal error has been detected by the Java Runtime Environment: # # SIGSEGV (0xb) at pc=0x00007fe4345f82cd, pid=15459, tid=15533 # # JRE version: OpenJDK Runtime Environment (11.0.6+10) (build 11.0.6+10-post-Ubuntu-1ubuntu119.10.1) # Java VM: OpenJDK 64-Bit Server VM (11.0.6+10-post-Ubuntu-1ubuntu119.10.1, mixed mode, sharing, tiered, compressed oops, g1 gc, linux-amd64) # Problematic frame: # C [libcrypto.so.1.1+0xad2cd] BIO_free+0x6d # # Core dump will be written. Default location: Core dumps may be processed with "/usr/share/apport/apport %p %s %c %d %P" (or dumping to /home/sly/src/abcl/core.15459) # # An error report file with more information is saved as: # /home/sly/src/abcl/hs_err_pid15459.log # # If you would like to submit a bug report, please visit: # https://bugs.launchpad.net/ubuntu/+source/openjdk-lts # The crash happened outside the Java Virtual Machine in native code. # See problematic frame for where to report the bug. # Process inferior-lisp<1> aborted (core dumped)
comment:31 Changed 5 years ago by
Resolution: | → fixed |
---|---|
Status: | assigned → closed |
According to <https://github.com/cl-plus-ssl/cl-plus-ssl/issues/72>, the BIO routines work for no Common Lisp implementation past OpenSSL 1.1.1
<https://github.com/cl-plus-ssl/cl-plus-ssl/pull/97> contains a fix for CL+SSL which seems to work on ABCL, so provisionally closing this ticket in the hopes that this is indeed fixed.
comment:32 Changed 5 years ago by
Accepted in master cl-plus-ssl <https://github.com/cl-plus-ssl/cl-plus-ssl/pull/98/files>.
Awaiting Quicklisp distribution, Ultralisp et.al
comment:33 Changed 5 years ago by
Milestone: | 1.6.2 → 1.7.0 |
---|
Under
"1.6.0-dev" "OpenJDK_64-Bit_Server_VM-Oracle_Corporation-1.8.0_212-b04""amd64-FreeBSD-12.0-RELEASE-p3"
I get a segfault, so something is definitely wonky here.
failing in libcrypto.so.111:BIO_free_all() for me