Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

numeric value and multivalue fields support #22

Open
wants to merge 27 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
c88daec
supports numeric fields (insert and query), multi-valued fields.
yxzhang Sep 2, 2013
456f6db
change version to 0.5
yxzhang Sep 2, 2013
3961356
add numeric value support in delete function
yxzhang Sep 2, 2013
5113dc1
add tests and about numeric values.
yxzhang Sep 2, 2013
42370c7
add notes about multivalued field
yxzhang Sep 2, 2013
e98d69e
add eclipse project settings for eclipse user importing easilly
yxzhang Sep 2, 2013
2db8080
add multivalued filed search examples in README
yxzhang Sep 2, 2013
e902c90
add schema-hints for numberic types and field index type
yxzhang Sep 8, 2013
7defb66
version to 0.6.0
yxzhang Sep 8, 2013
66cc5d6
supports sort
yxzhang Sep 8, 2013
3148f0a
add clojure syntax highlighting in readme
yxzhang Sep 8, 2013
c5f8e21
correct wiki syntax
yxzhang Sep 8, 2013
7275b04
update readme
yxzhang Sep 8, 2013
c8ca096
fix bug about lost namespace of "*schema-hints*" in readme
yxzhang Sep 9, 2013
fbe0353
parallel indexing large number of data and schema as index's metadata
yxzhang Sep 10, 2013
e349d90
format readme to avoid too long rows
yxzhang Sep 10, 2013
fe08e55
when parallel indexing report the increased swap value instead of deref
yxzhang Sep 10, 2013
b0f260e
in readme: fix error about parallel indexing
yxzhang Sep 10, 2013
4901919
negitave number search support
yxzhang Sep 11, 2013
3d318ba
simplify schema definition in readme
yxzhang Sep 11, 2013
fd6c766
support untokenized field's query just like {:title "A.B"}
yxzhang Sep 11, 2013
9bf827e
fix bug for float range query
yxzhang Sep 11, 2013
596fb65
ver to 0.8.6
yxzhang Sep 11, 2013
bc5201c
Supports to upsert and only load certain stored fields and fast collect
yxzhang Sep 13, 2013
d6a7d53
fix version in readme
yxzhang Sep 13, 2013
9b29eeb
fix NPE on jdk6
yxzhang Sep 13, 2013
5d867e6
fix bug about numeric value stored but not indexed
yxzhang Sep 13, 2013
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 36 additions & 0 deletions .classpath
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
<?xml version="1.0" encoding="UTF-8"?>
<classpath>
<classpathentry kind="con" path="org.eclipse.jdt.launching.JRE_CONTAINER"/>
<classpathentry kind="src" path="src"/>
<classpathentry kind="src" path="test">
<attributes>
<attribute name="optional" value="true"/>
</attributes>
</classpathentry>
<classpathentry kind="src" path="dev-resources">
<attributes>
<attribute name="optional" value="true"/>
</attributes>
</classpathentry>
<classpathentry kind="src" path="resources">
<attributes>
<attribute name="optional" value="true"/>
</attributes>
</classpathentry>
<classpathentry kind="lib" path="target/classes">
<attributes>
<attribute name="optional" value="true"/>
</attributes>
</classpathentry>
<classpathentry exported="true" kind="con" path="ccw.LEININGEN_CONTAINER">
<attributes>
<attribute name="org.eclipse.jdt.launching.CLASSPATH_ATTR_LIBRARY_PATH_ENTRY" value="clucy/target/native/linux/x86_64"/>
</attributes>
</classpathentry>
<classpathentry exported="true" kind="lib" path="classes">
<attributes>
<attribute name="optional" value="true"/>
</attributes>
</classpathentry>
<classpathentry kind="output" path="bin"/>
</classpath>
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -5,3 +5,6 @@ clucy*.jar
pom.xml
pom.xml.asc
.lein-failures
/bin
/.settings/ccw.repl.cmdhistory.prefs
/.settings
35 changes: 35 additions & 0 deletions .project
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
<?xml version="1.0" encoding="UTF-8"?>
<projectDescription>
<name>clucy</name>
<comment></comment>
<projects>
</projects>
<buildSpec>
<buildCommand>
<name>ccw.builder</name>
<arguments>
</arguments>
</buildCommand>
<buildCommand>
<name>ccw.leiningen.builder</name>
<arguments>
</arguments>
</buildCommand>
<buildCommand>
<name>org.eclipse.jdt.core.javabuilder</name>
<arguments>
</arguments>
</buildCommand>
<buildCommand>
<name>org.eclipse.wst.common.project.facet.core.builder</name>
<arguments>
</arguments>
</buildCommand>
</buildSpec>
<natures>
<nature>org.eclipse.wst.common.project.facet.core.nature</nature>
<nature>org.eclipse.jdt.core.javanature</nature>
<nature>ccw.leiningen.nature</nature>
<nature>ccw.nature</nature>
</natures>
</projectDescription>
3 changes: 2 additions & 1 deletion .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ language: clojure
lein: lein2

jdk:
- oraclejdk6
- openjdk6
- openjdk7
- oraclejdk7
- oraclejdk7
228 changes: 212 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,72 +1,268 @@
Clucy
ZClucy
=====

[![Build Status](https://secure.travis-ci.org/weavejester/clucy.png?branch=master)](http://travis-ci.org/weavejester/clucy)
[![Build Status](https://secure.travis-ci.org/yxzhang/clucy.png?branch=master)](http://travis-ci.org/yxzhang/clucy)

ZClucy is forked from clucy a Clojure interface to [Lucene](http://lucene.apache.org/).
There are some enhanced futures in ZClucy :

1. Supports numeric values (such as int, long, float double)
1. Supports multivalued fileds.
1. Supports to sort results
1. Supports parallel indexing large number of data
1. Supports schema defined as index's metadata
1. Supports upsert (Insert documents or update them if they exist)
1. Supports to only load certain stored fields and fast collect field values for statistics (eg. sum, avg )

Clucy is a Clojure interface to [Lucene](http://lucene.apache.org/).

Installation
------------

To install Clucy, add the following dependency to your `project.clj`
file:

[clucy "0.4.0"]
```clojure
[zclucy "0.9.2"]
```

Usage
-----

To use Clucy, first require it:

```clojure
(ns example
(:require [clucy.core :as clucy]))
```

Then create an index. You can use `(memory-index)`, which stores the search
index in RAM, or `(disk-index "/path/to/a-folder")`, which stores the index in
a folder on disk.

```clojure
(def index (clucy/memory-index))
```

Next, add Clojure maps to the index:

```clojure
(clucy/add index
{:name "Bob", :job "Builder"}
{:name "Donald", :job "Computer Scientist"})
```

You can remove maps just as easily:

```clojure
(clucy/delete index
{:name "Bob", :job "Builder"})
```

Once maps have been added, the index can be searched:

```clojure
user=> (clucy/search index "bob" 10)
({:name "Bob", :job "Builder"})
```

```clojure
user=> (clucy/search index "scientist" 10)
({:name "Donald", :job "Computer Scientist"})
```

You can search and remove all in one step. To remove all of the
scientists...

```clojure
(clucy/search-and-delete index "job:scientist")
```

Manipulate Schema
--------------

By default, every field is a string stored, indexed, analyzed and stores norms. You can customise it just like :

```clojure
(def people-schema {:name {:type "string"} :age {:type "int" }})
```

```clojure
(binding [clucy/*schema-hints* people-schema]
;.... do some adding
;.....do some query
)
```

Or you can add a schema with index when create it :

```clojure
(def index (clucy/memory-index people-schema))
```

Then you need not bind *schema-hints* anymore. Now here two statements get the same result :

```clojure
(clucy/add ....)
(clucy/add ....)
```

```clojure
(binding [clucy/*schema-hints* people-schema]
(clucy/add ....)
(clucy/add ....)
)
```

Then name is still a string stored, indexed, analyzed and stores norms, but age is a int without being analyzed and norms.


Numeric Types
--------------

You can add maps with numeric value to the index:

```clojure
(def people-schema {:name {:type "string"} :age {:type "int"}})
```

```clojure
(binding [clucy/*schema-hints* people-schema]
(clucy/add index
{:name "Bob", :age (int 20)}
{:name "Donald", :age (int 35)}))
```

Once maps have been added, the index can be searched:

```clojure
user=> (binding [clucy/*schema-hints* people-schema]
(clucy/search index "age:20" 10))
({:age 20, :name "Bob"})
```

Or do range query just as :

```clojure
user=> (binding [clucy/*schema-hints* people-schema]
(clucy/search index "age:[32 TO 35]" 10))
({:age 35, :name "Donald"})
```

Numberic type can be one of int, long, double, float.

Multivalued Fields
--------------

Storing Fields
You can use clojure collection to manage multivalued fields, eg.

```clojure
(clucy/add index
{:name "Bob", :books ["Clojure Programming" "Clojure In Action"] }
```

```clojure
user=> (clucy/search index "books:action" 10)
({:name "Bob", :books ["Clojure Programming" "Clojure In Action"]})
```

Sort Results
--------------
First add some documents with a defined schema

```clojure
(def people-schema {:name {:type "string"} :age {:type "int" }})

(binding [clucy/*schema-hints* people-schema]
(clucy/add index
{:name "Bob", :age (int 20)}
{:name "Donald", :age (int 35)}))
```

Then you can sort the result when search them :

```clojure
user=> ((binding [clucy/*schema-hints* people-schema]
(clucy/search index "*:*" 10 :sort-by "age desc"))
({:age 35, :name "Donald"} {:age 20, :name "Bob"})
```

You can sort by several fields just like :

```clojure
((binding [clucy/*schema-hints* people-schema]
(clucy/search index "*:*" 10 :sort-by "age desc, name asc"))
```

Or sort by document number (index order) :

```clojure
((binding [clucy/*schema-hints* people-schema]
(clucy/search index "*:*" 10 :sort-by "$doc asc"))
```

Or sort by document score (relevance):

By default all fields in a map are stored and indexed. If you would
like more fine-grained control over which fields are stored and index,
add this to the meta-data for your map.
```clojure
((binding [clucy/*schema-hints* people-schema]
(clucy/search index "*:*" 10 :sort-by "$score asc"))
```

(with-meta {:name "Stever", :job "Writer", :phone "555-212-0202"}
{:phone {:stored false}})
Parallel indexing
--------------------

When you want to index a large number of data such as data from a large text file with one record per line, you should use padd instead of add

```clojure
(with-open [r (clojure.java.io/reader file)]
(let [stime (System/currentTimeMillis)
reporter (fn [n]
(when (= (mod n 100000) 0) ; print process per 100K
(println n " cost:"(- (System/currentTimeMillis) stime)))) ]
(clucy/padd index reporter
(map
#(let [row (clojure.string/split % #"\s+")]
{:id (row 0), :name (row 1) })
(line-seq r)))))
```

Upsert
--------------------
Before use upsert , you must define ID fields (just like primary key) in schema, for example use "id" as unique value field :

```clojure
(def index (clucy/disk-index "mypath" {_id [:id]}, :id {:type "int"}, :name {:type "string"}))

;There's no record so just insert it.
(clucy/upsert index {:id 1 :name "Tom"})

=>(clucy/search index "*:*" 10)
;({:id 1 :name "Tom"})

;now update Tom to Jack
(clucy/upsert index {:id 1 :name "Jack"})

=>(clucy/search index "*:*" 10)
;({:id 1 :name "Jack"})
```


Load certain stored fields and fast collect field values for statistics (eg. sum, avg )
--------------------
```clojure
(let [index (memory-index people-schema)
avg (atom 0) sum (atom 0) max (atom 0) min (atom 0)]
;Do some insert here ........
;.......
(search index "age:[34 TO 48]" 100 :fields [:age] ; only load age field
:doc-collector (fn [{age :age} i total]
(println "i:" i ",total: " total ",age:" age)
(swap! sum + age)
(when (or (= i 0) (> @min age)) (reset! min age))
(when (or (= i 0) (> age @max)) (reset! max age))
(when (= i (dec total))
(reset! avg (/ @sum total)))))
```

When the map above is saved to the index, the phone field will be
available for searching but will not be part of map in the search
results. This example is pretty contrived, this makes more sense when
you are indexing something large (like the full text of a long
article) and you don't want to pay the price of storing the entire
text in the index.

Default Search Field
--------------------
Expand Down
15 changes: 8 additions & 7 deletions project.clj
Original file line number Diff line number Diff line change
@@ -1,15 +1,16 @@
(defproject clucy "0.4.0"
(defproject zclucy "0.9.2"
:description "A Clojure interface to the Lucene search engine"
:url "http://github/weavejester/clucy"
:url "http://github.com//yxzhang/clucy"
:dependencies [[org.clojure/clojure "1.4.0"]
[org.apache.lucene/lucene-core "4.2.0"]
[org.apache.lucene/lucene-queryparser "4.2.0"]
[org.apache.lucene/lucene-analyzers-common "4.2.0"]
[org.apache.lucene/lucene-highlighter "4.2.0"]]
[org.apache.lucene/lucene-core "4.4.0"]
[org.apache.lucene/lucene-queryparser "4.4.0"]
[org.apache.lucene/lucene-analyzers-common "4.4.0"]
[org.apache.lucene/lucene-highlighter "4.4.0"]]
:license {:name "Eclipse Public License"
:url "http://www.eclipse.org/legal/epl-v10.html"}
:profiles {:1.4 {:dependencies [[org.clojure/clojure "1.4.0"]]}
:1.5 {:dependencies [[org.clojure/clojure "1.5.0"]]}
:1.6 {:dependencies [[org.clojure/clojure "1.6.0-master-SNAPSHOT"]]}}
:codox {:src-dir-uri "http://github/weavejester/clucy/blob/master"
:warn-on-reflection true
:codox {:src-dir-uri "http://github.com/yxzhang/clucy/blob/master"
:src-linenum-anchor-prefix "L"})
Loading