I think it would be nice if the datasplash.api functions have kind of the same semantics as the clojure.core ones as it otherwise makes debugging things at the repl extremely painful.
As an example.
(def data1 {:a (int 1)})
(def data2 {:a (long 1)})
(= data1 data2)
;; => true
(clojure.set/intersection #{data1} #{data2})
;; => #{{:a 1}}
(let [p (ds/make-pipeline {})
input1 (ds/generate-input [data1] p)
input2 (ds/generate-input [data2] p)
_ (ds/->> :intersect-pipeline
(ds/intersect-distinct {:name :intersect} input1 input2)
(ds/write-json-file "test-output" {}))]
(-> (ds/run-pipeline p)
(ds/wait-pipeline-result)))
The last pipeline produces no results (it will when changing data2 to {:a (int 1)}. The problem is that if there is need to compare, intersect or group-by a lot of data, it is first needed to make all the rows comparable (with something like clojure.walk ) which can be very expensive.
I think it would be nice if the
datasplash.apifunctions have kind of the same semantics as theclojure.coreones as it otherwise makes debugging things at the repl extremely painful.As an example.
The last pipeline produces no results (it will when changing
data2to{:a (int 1)}. The problem is that if there is need to compare, intersect or group-by a lot of data, it is first needed to make all the rows comparable (with something likeclojure.walk) which can be very expensive.