Opened 6 months ago

Last modified 10 days ago

#464 assigned defect

drakma / cl+ssl failure

Reported by: charmon Owned by: Evenson Not Org
Priority: blocker Milestone: 1.6.1
Component: libraries Version: 1.6.0-dev
Keywords: cl+ssl, ssl, drakma Cc:
Parent Tickets:

Description (last modified by Evenson Not Org)

c.f. <https://github.com/armedbear/abcl/issues/127>

This is probably a cl+ssl/cffi issue rather than a drakma issue, but when I try to do, e.g.,

(drakma:http-request "https://common-lisp.net/")

I get the following error:

A failure in the SSL library occurred on handle #<com.sun.jna.Pointer native@0x7fa96089a760 {41501C8A}> (return code: 1).
SSL error queue:
error:20080078:BIO routines:bio_write_intern:uninitialized
   [Condition of type CL+SSL::SSL-ERROR-SSL]

Restarts:
 0: [RETRY] Retry SLIME REPL evaluation request.
 1: [*ABORT] Return to SLIME's top level.
 2: [ABORT] Abort thread.

Backtrace:
  0: (#<FUNCTION {63F35033}> #<CL+SSL::SSL-ERROR-SSL {5769AA74}> #<FUNCTION {63F35033}>)
  1: (APPLY #<FUNCTION {63F35033}> (#<CL+SSL::SSL-ERROR-SSL {5769AA74}> #<FUNCTION {63F35033}>))
  2: (SYSTEM::RUN-HOOK SYSTEM::*INVOKE-DEBUGGER-HOOK* #<CL+SSL::SSL-ERROR-SSL {5769AA74}> #<FUNCTION {63F35033}>)
  3: (INVOKE-DEBUGGER #<CL+SSL::SSL-ERROR-SSL {5769AA74}>)
  4: (ERROR CL+SSL::SSL-ERROR-SSL :HANDLE #<com.sun.jna.Pointer native@0x7fa96089a760 {41501C8A}> :RET 1 ...)
  5: (CL+SSL:MAKE-SSL-CLIENT-STREAM #<TWO-WAY-STREAM {50E6169D}> :VERIFY NIL :HOSTNAME "common-lisp.net" ...)
  6: (#<FUNCTION {4A462CFF}>)
  7: (CL+SSL::CALL-WITH-GLOBAL-CONTEXT #<com.sun.jna.Pointer native@0x7fa9609829d0 {BB3CD71}> NIL #<FUNCTION {4A462CFF}>)
  8: (DRAKMA::MAKE-SSL-STREAM #<TWO-WAY-STREAM {50E6169D}> :HOSTNAME "common-lisp.net" :CERTIFICATE NIL ...)
  9: (DRAKMA:HTTP-REQUEST "https://common-lisp.net/")
 10: (SYSTEM::%EVAL (DRAKMA:HTTP-REQUEST "https://common-lisp.net/"))
 11: (EVAL (DRAKMA:HTTP-REQUEST "https://common-lisp.net/"))

Subtickets

Attachments (3)

jna-5.3.1.patch (970 bytes) - added by Evenson Not Org 5 months ago.
Use latest stable JNA as jna-5.3.1
jna-5.4.0.patch (1.2 KB) - added by Evenson Not Org 4 months ago.
Patch to use jna-5.4.0
asdf-test.pach (491 bytes) - added by Evenson Not Org 4 months ago.
Patch asdf to invoke FIVEAM tests

Download all attachments as: .zip

Change History (31)

comment:1 Changed 5 months ago by Evenson Not Org

Component: compilerlibraries
Milestone: 1.6.0

Under
"1.6.0-dev" "OpenJDK_64-Bit_Server_VM-Oracle_Corporation-1.8.0_212-b04""amd64-FreeBSD-12.0-RELEASE-p3"

I get a segfault, so something is definitely wonky here.

; Loaded /usr/home/mevenson/.slime/fasl/2.23/armedbear-1.6.0-dev-unix-x86-64/contrib/swank-\
asdf.abcl (0.319 seconds)                                                                   
#                                                                                           
# A fatal error has been detected by the Java Runtime Environment:                          
#                                                                                           
#  SIGSEGV (0xb) at pc=0x000000096ad86f60, pid=55874, tid=0x0000000000018b5d                
#                                                                                           
# JRE version: OpenJDK Runtime Environment (8.0_212-b04) (build 1.8.0_212-b04)              
# Java VM: OpenJDK 64-Bit Server VM (25.212-b04 mixed mode bsd-amd64 compressed oops)       
# Problematic frame:                                                                        
# C  0x000000096ad86f60                                                                     
#                                                                                           
# Core dump written. Default location: /usr/home/mevenson/work/ansi-test/java.core          
#                                                                                           
# An error report file with more information is saved as:                                   
# /usr/home/mevenson/work/ansi-test/hs_err_pid55874.log                                     
#                                                                                           
# If you would like to submit a bug report, please visit:                                   
#   http://bugreport.java.com/bugreport/crash.jsp                                           
# The crash happened outside the Java Virtual Machine in native code.                       
# See problematic frame for where to report the bug.                                        
#                                                                                           
                                                                                            
Process inferior-lisp abort trap (core dumped)                                              
                                                  

failing in libcrypto.so.111:BIO_free_all() for me

Stack: [0x00007fffdcfa9000,0x00007fffdd0a9000],  sp=0x00007fffdd0a1488,  free space=993k                                          
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)                                                   
C  0x0000000c988027c0                                                                                                             
C  [libcrypto.so.111+0x25a405]  BIO_free_all+0x25                                                                                 
C  [libssl.so+0x711b2]  SSL_free+0xd2                                                                                             
C  [jna3590526486745548170.tmp+0x12be4]  float+0x4c                                                                               
C  0x00007fffdd0a1f50       
Last edited 4 months ago by Evenson Not Org (previous) (diff)

comment:2 Changed 5 months ago by Evenson Not Org

Update to latest stable jna-5.3.1 locally, then testing OpenJDK8 shows failure to be platform dependent.

  1. FreeBSD -- segmentation violation; JVM crashes; no Lisp trace
  1. macOS -- Null pointer exception with the following stacktrace
  8: (JCALL-RAW #<java.lang.reflect.Method public java.lang.Object com.sun..... {6E71BAF0}> NIL #<java.lang.Class class com.sun.jna.Pointer {66A9D1AB}> #<jarray [Ljava.lang.Object;@2ffedc49 {7D7AEA62}>)
      Locals:
        EXTENSIONS::|method-ref| = #<method public java.lang.Object com.sun.jna.Function.invoke(java.lang.Class,java.lang.Object[])>
        EXTENSIONS::|instance|#1 = NIL
        EXTENSIONS::|&rest|#2 = #<java class com.sun.jna.Pointer>
        EXTENSIONS::|args|#3 = #<jarray [Ljava.lang.Object;@2ffedc49 {7D7AEA62}>
  9: (CL+SSL:ENSURE-INITIALIZED)
 10: (CL+SSL:MAKE-CONTEXT :VERIFY-DEPTH 10 :VERIFY-MODE 0 :VERIFY-CALLBACK ...)

Changed 5 months ago by Evenson Not Org

Attachment: jna-5.3.1.patch added

Use latest stable JNA as jna-5.3.1

comment:3 Changed 5 months ago by Evenson Not Org

Owner: set to Evenson Not Org
Status: newassigned

comment:4 Changed 5 months ago by Evenson Not Org

Somewhere along the line, abcl-1.6.0-dev has picked up 7 unexpected failures in the CFFI tests:

7 unexpected failures: FUNCALL.VARARGS.NOSTDLIB,                               
         FUNCALL.VARARGS.CHAR, FUNCALL.VARARGS.INT, FUNCALL.VARARGS.LONG,               
         FUNCALL.VARARGS.DOUBLE, FUNCALL.VARARGS.STRING, TEST-ASDF-LOAD      
Last edited 5 months ago by Evenson Not Org (previous) (diff)

comment:5 Changed 5 months ago by charmon

FWIW, going back in time to ABCL

fd87311d * @ Update to jna-4.2.2

and CFFI

600f440 * @ Revert "Mutex on hash table for cffi-libffi"

and running (asdf:test-system 'cffi) yields:

3 out of 308 total tests failed: FSBV.WFO, FSBV.MAKEPAIR.1, FSBV.MAKEPAIR.2.No unexpected failures.

while running that version of ABCL with current cffi HEAD yields:

10 out of 325 total tests failed: FUNCALL.VARARGS.NOSTDLIB, FUNCALL.VARARGS.CHAR, FUNCALL.VARARGS.INT, FUNCALL.VARARGS.LONG, FUNCALL.VARARGS.DOUBLE, FUNCALL.VARARGS.STRING, FSBV.WFO, FSBV.MAKEPAIR.1, FSBV.MAKEPAIR.2, TEST-ASDF-LOAD.
7 unexpected failures: FUNCALL.VARARGS.NOSTDLIB, FUNCALL.VARARGS.CHAR, FUNCALL.VARARGS.INT, FUNCALL.VARARGS.LONG, FUNCALL.VARARGS.DOUBLE, FUNCALL.VARARGS.STRING, TEST-ASDF-LOAD.

So I think some things got broken along the way in CFFI. I can bisect.

comment:6 Changed 5 months ago by charmon

Looks like this is the culprit:

Head:     0d2b3c0 tests: funcall: call sprintf with foreign-funcall-varargs
Tags:     v0.19.0 (4), v0.20.0 (25)

0d2b3c0067529a9862c4c967158b75b040d1d8f8 is the first bad commit
commit 0d2b3c0067529a9862c4c967158b75b040d1d8f8
Author: Daniel Kochmanski <daniel@turtleware.eu>
Date:   Sun Oct 1 14:43:17 2017 +0200
    tests: funcall: call sprintf with foreign-funcall-varargs
    
    This function takes variable number of arguments and we have
    abstraction for that.
:040000 040000 bbdb0b48f2d5d8fc9fde1a01b8955f88e83066d2 a5825c1e6228c3f9a7a0e41ebbe873c5c1622fdb M	tests

Bisect Rest (1)
0d2b3c0 * bad @ tests: funcall: call sprintf with foreign-funcall-varargs

comment:7 Changed 5 months ago by charmon

Ah, but of course that's just exposing the broken behavior from the tests, so who knows when the actual breakage came in.

comment:8 Changed 5 months ago by charmon

Ok, so I managed to get things to build with current ABCL, cffi, and drakma, and mostly-current cl+ssl. The problem is that pointing cl+ssl at libssl.so.1.1 breaks things:

commit fde3b207cb61f5e23a40f75a733c4a2b31a69dd1
Author: Anton Vodonosov <avodonosov@yandex.ru>
Date:   Mon Feb 4 01:20:37 2019 +0000
    Actually use Openssl-1.1, if available.

Is the offending commit. If we go back to openssl-1.0, then things work again. So there are probably multiple problems. Why does openssl-1.1 break things? Why does this trigger the various NPEs/segfaults seen above? etc...

Changed 4 months ago by Evenson Not Org

Attachment: jna-5.4.0.patch added

Patch to use jna-5.4.0

comment:9 Changed 4 months ago by Evenson Not Org

#+begin_example
CL+SSL> (default-ssl-method)
TLS-METHOD
#+end_example

#+begin_src lisp

(setf *ssl-global-method* (funcall (or method (default-ssl-method))))
(setf *ssl-global-context* (ssl-ctx-new *ssl-global-method*))

#+end_src

* Run the CL+SSL test suites

Invocation via FIVEAM:

#+begin_example
(fiveam:run-all-tests)
#+end_example

Add invocation to ASDF:


(swank:operate-on-system-for-emacs "cl+ssl.test" 'test-op)

Running test suite CL+SSL
 Running test suite CL+SSL.SNI
 Running test suite CL+SSL.VERIFY-HOSTNAME
  Running test VERIFY-GOOGLE-CERT-PRINTABLE-STRING .
  Running test VERIFY-HOSTNAME-FAIL .........
  Running test VERIFY-GOOGLE-CERT-DNS-WILDCARD .
  Running test VERIY-HOSTNAME-SUCCESS ..............
  Running test VERIFY-GOOGLE-CERT-TELETEX-STRING .
  Running test VERIFY-GOOGLE-CERT-UNIVERSAL-STRING .
  Running test VERIFY-GOOGLE-CERT-BMP-STRING .
  Running test VERIFY-GOOGLE-CERT .
  Running test VERIFY-GOOGLE-CERT-WITHOUT-DNS .
 Running test SANITY-CHECK.2 .
 Running test SANITY-CHECK.1 .
 Running test COMPAT-OPENSSL-VERSION .
 Running test suite CL+SSL.BADSSL-COM
  Running test WRONG.HOST ..Xf
  Running test EXPIRED ..Xf
  Running test UNTRUSTED-ROOT .
  Running test SELF-SIGNED . Did 41 checks.
    Pass: 39 (95%)
    Skip: 0 ( 0%)
    Fail: 2 ( 4%)

 Failure Details:
 --------------------------------
 EXPIRED []: 
      Unexpected Error: #<JAVA:JAVA-EXCEPTION java.lang.NullPointerException {59A01E02}>
Java exception 'java.lang.NullPointerException'...
 --------------------------------
 --------------------------------
 WRONG.HOST []: 
      Unexpected Error: #<JAVA:JAVA-EXCEPTION java.lang.NullPointerException {2F1F3092}>
Java exception 'java.lang.NullPointerException'...
 --------------------------------

(:COMPILATION-RESULT NIL T 4.431 NIL NIL)
CL-USER> 

Last edited 4 months ago by Evenson Not Org (previous) (diff)

Changed 4 months ago by Evenson Not Org

Attachment: asdf-test.pach added

Patch asdf to invoke FIVEAM tests

comment:10 Changed 2 months ago by charmon

Here's a minimal varargs test that fails (from the CFFI test suite):

(asdf:test-system "cffi")

(CFFI::%FOREIGN-FUNCALL-VARARGS "sum_double_arbitrary"
                                (:INT 1)
                                (:DOUBLE 3.14d0
                                         :DOUBLE))

comment:11 Changed 2 months ago by charmon

Ok, I'm down a rathole of multiple failures. I've fixed the vararg problem (more on that later). But it looks like the cl+ssl version checking stuff is completely bogus. Somehow, we open:

/usr/local/Cellar/openssl/1.0.2t/lib/libssl.1.0.0.dylib

but then the version checking stuff thinks that we have openssl 1.1 installed and tries to call openssl 1.1 functions and fails.

comment:12 Changed 2 months ago by charmon

So both SBCL and ABCL use the following (on my system):

/usr/local/Cellar/openssl/1.0.2t/lib/libssl.1.0.0.dylib

But on SBCL, I get:

CL-USER> (format nil "~X" (cl+ssl::ssl-eay))
"1000214F"

and on ABCL, I get:

CL-USER> (format nil "~X" (cl+ssl::ssl-eay))
"20000000"

I think this is somehow making the cl+ssl stuff think that we're using post version 1.0 and it tries to call a function that doesn't exist.

comment:13 Changed 2 months ago by charmon

Success! After a fashion anyway... Need to do:

@@ -125,7 +125,7 @@ (defmacro define-crypto-function-ex ((&key since vanished) name-and-options &bod
               *cl+ssl-crypto-foreign-function-names*
               :test 'equal)
      (defcfun-versioned (:since ,since :vanished ,vanished)
-         ,(append name-and-options  #+(and lispworks darwin) '(:library libcrypto))
+         ,(append name-and-options  #+(and (or abcl lispworks) darwin) '(:library libcrypto))
        ,@body)))
 
 (defmacro define-crypto-function (name-and-options &body body)

Will post a PR to cl+ssl.

comment:14 Changed 2 months ago by charmon

Oh, and I fixed the ABCL varargs thing while I was at it, although this was probably a red herring.

comment:15 in reply to:  13 Changed 2 months ago by Evenson Not Org

Replying to charmon:

Will post a PR to cl+ssl.

<https://github.com/cl-plus-ssl/cl-plus-ssl/pull/89>

comment:16 Changed 5 weeks ago by Evenson Not Org

comment:17 Changed 4 weeks ago by Evenson Not Org

Priority: majorblocker

Ensure that we get this mess sorted out for next release.

comment:18 in reply to:  16 ; Changed 3 weeks ago by Evenson Not Org

Replying to Evenson Not Org:

I fixed the ABCL varargs thing while I was at it

<https://github.com/slyrus/cffi/commit/1cfc67dd367fc669875ce55d46c29e7d16b0c7c3>

Any reason we don't get a pull request in for CFFI for this?

comment:19 in reply to:  18 Changed 3 weeks ago by Evenson Not Org

Replying to Evenson Not Org:

Replying to Evenson Not Org:

I fixed the ABCL varargs thing while I was at it

<https://github.com/slyrus/cffi/commit/1cfc67dd367fc669875ce55d46c29e7d16b0c7c3>

Any reason we don't get a pull request in for CFFI for this?

With this patch, 1.6.0-dev compiled carefully with OpenJDK_64-Bit_Server_VM-AdoptOpenJDK-1.8.0_232-b09 on x86_64-Mac_OS_X-10.14.6 doesn't seem to SEGV, although the CL+SSL.TEST invocation has some disturbing complaints on uninitialized memory in the Block IO (BIO) routines.

CL-USER> (fiveam:run-all-tests)

Running test suite CL+SSL
 Running test suite CL+SSL.BADSSL-COM
  Running test WRONG.HOST ..Xf
  Running test UNTRUSTED-ROOT .
  Running test EXPIRED ..Xf
  Running test SELF-SIGNED .
 Running test COMPAT-OPENSSL-VERSION .
 Running test suite CL+SSL.VERIFY-HOSTNAME
  Running test VERIY-HOSTNAME-SUCCESS ..............
  Running test VERIFY-GOOGLE-CERT-WITHOUT-DNS .
  Running test VERIFY-GOOGLE-CERT-TELETEX-STRING .
  Running test VERIFY-GOOGLE-CERT-PRINTABLE-STRING .
  Running test VERIFY-GOOGLE-CERT-BMP-STRING .
  Running test VERIFY-GOOGLE-CERT-DNS-WILDCARD .
  Running test VERIFY-GOOGLE-CERT .
  Running test VERIFY-GOOGLE-CERT-UNIVERSAL-STRING .
  Running test VERIFY-HOSTNAME-FAIL .........
 Running test SANITY-CHECK.1 .
 Running test SANITY-CHECK.2 .
 Running test suite CL+SSL.SNI Did 41 checks.
    Pass: 39 (95%)
    Skip: 0 ( 0%)
    Fail: 2 ( 4%)

 Failure Details:
 --------------------------------
 EXPIRED []: 
      Unexpected Error: #<CL+SSL::SSL-ERROR-SSL {AF89B4E}>
A failure in the SSL library occurred on handle #<com.sun.jna.Pointer native@0x7fae7eeae400 {63DDE5C2}> (return code: 1).
SSL error queue:
error:20080078:BIO routines:bio_write_intern:uninitialized
..
 --------------------------------
 --------------------------------
 WRONG.HOST []: 
      Unexpected Error: #<CL+SSL::SSL-ERROR-SSL {255828B}>
A failure in the SSL library occurred on handle #<com.sun.jna.Pointer native@0x7fae7b045400 {1E7233FF}> (return code: 1).
SSL error queue:
error:20080078:BIO routines:bio_write_intern:uninitialized
..
 --------------------------------

The SEGV seems to be definitely happening around the use openjdk11 to compile or run the code. My "Heisenbugs" were presumably the result of mixing the fasls from openjdk11 output.

Last edited 3 weeks ago by Evenson Not Org (previous) (diff)

comment:20 Changed 3 weeks ago by charmon

Yes, it's definitely SEGV'ing here for me :(

comment:21 Changed 3 weeks ago by charmon

For better or worse, it fails on JDK 8 for me too now on my linux box.

comment:22 Changed 3 weeks ago by Evenson Not Org

charmon has submitted an upstream pull for CFFI <https://github.com/cffi/cffi/pull/149>

comment:23 Changed 3 weeks ago by Evenson Not Org

I'm still suffering from the feeling that we are dealing with some sort of Heisenbug around the memory allocation/initialization strategy of openssl 1.1.

Running the cl+ssl.test suite against released abcl-1.5.0 binaries which uses jna-4.2.2 segfaults on macOS, has the two test failures with bio_write_intern:uninitialized on Linux, and segfaults on freebsd. And sometimes abcl-1.6.0-dev seems to work on macOS, but I can't reliably reproduce and/or isolate why exactly.

The space of potential variables here is large:

  • ABCL version (using 1.5.0 binaries, or 1.6.0-dev compiled from HEAD)
  • OS (macOS fails "sometimes", Linux "sometimes", has always failed on FreeBSD, Windows untested)
  • JVM platform (testing with openjdk8 and openjdk11 doesn't seem to make a difference)
  • OpenSSL version (abcl-1.5.0 was released when openssl 1.0.x/1.1.y were in use)
  • jna version (abcl-1.5.0 used jna-4.2.2; abcl-1.6.0-dev has jna-5.5.0 but "downgrading" to jna-4.2.2 doesn't seem to show a difference)
  • CFFI version (presumably better with the varargs patch)

My current strategy towards working on this:

  1. Get CI artifacts into source tree so we can run the tests in a much more controlled manner. I will work on using Github/Travis? as the first target, as that seems easiest for me at the moment
  1. Figure out how to run the JVM under GDB so that we can inspect the smashed stack, and hopefully set a breakpoint before it happens. Not sure if this is entirely possible in 2019 without a whole lot of elbow grease.
Last edited 3 weeks ago by Evenson Not Org (previous) (diff)

comment:24 Changed 3 weeks ago by charmon

One other variable, or at least value to consider along the openssl axis, is +/- libressl.

comment:25 Changed 3 weeks ago by Evenson Not Org

Hmm. Both openjdk8 and openjdk11 work on the xenial distribution available to <https://travis-ci.org/easye/abcl/jobs/613567826>

[17:02]

Maybe somehow dealing with "static" memory initialization if openssl is used more than once? That shouldn't be able to happen, but maybe multiple JVMs are somehow keeping memory around?

comment:26 Changed 2 weeks ago by Evenson Not Org

Milestone: 1.6.01.6.1

Ticket retargeted after milestone closed

comment:27 Changed 2 weeks ago by Evenson Not Org

Under one of the Travis builds <https://travis-ci.org/easye/abcl/jobs/616239946>, I notice

   Running test suite CL+SSL.BADSSL-COM
   #
   # A fatal error has been detected by the Java Runtime Environment:
   #
   #  SIGSEGV (0xb) at pc=0x00007fff76853e50, pid=10069, tid=0x0000000000004a03
   #
   # JRE version: OpenJDK Runtime Environment (8.0_232-b09) (build 1.8.0_232-b09)
   # Java VM: OpenJDK 64-Bit Server VM (25.232-b09 mixed mode bsd-amd64 compressed oops)
   # Problematic frame:

   # C  [libssl.35.dylib+0x3be50]  SSL_CTX_set_default_verify_paths+0x10

Last edited 2 weeks ago by Evenson Not Org (previous) (diff)

comment:28 Changed 10 days ago by Evenson Not Org

Description: modified (diff)
Note: See TracTickets for help on using tickets.