Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query container #373

Open
wants to merge 27 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
12d5fb6
Query contaier
elshize Apr 26, 2020
1d3aa88
Query container parsing
elshize Apr 26, 2020
ce2b720
Merge branch 'master' into query-container
elshize Apr 27, 2020
cdc17f3
CLI test
elshize Apr 27, 2020
98fe8b1
Merge branch 'query-container' of github.com:pisa-engine/pisa into qu…
elshize Apr 27, 2020
b0e5d1a
Fix .travis.yml syntax
elshize Apr 27, 2020
6e2ab62
Fix .travis.yml syntax
elshize Apr 27, 2020
2cce2cd
Fix when cli test are executed
elshize Apr 27, 2020
982d316
Merge branch 'master' into query-container
elshize Apr 28, 2020
1838258
Refactor out common code from tool
elshize Apr 28, 2020
3ba4588
Merge branch 'master' into query-container
elshize Apr 29, 2020
7107f65
Small refactoring and term resolver tests
elshize May 1, 2020
ede9c98
Fix tool description
elshize May 1, 2020
b8f625c
Multiple thresholds per query
elshize May 3, 2020
78cf15c
Return program with 1 if fails
elshize May 3, 2020
d4e63cf
Merge branch 'master' into query-container
elshize May 19, 2020
ebf1acb
Merge branch 'master' into query-container
elshize May 21, 2020
83f9c74
Fix merging issue
elshize May 22, 2020
4b0b05c
Merge branch 'master' into query-container
elshize Jun 1, 2020
a22f794
Merge branch 'master' into query-container
elshize Jun 2, 2020
e0b052a
Merge branch 'master' into query-container
elshize Jun 2, 2020
6382a38
Merge branch 'master' into query-container
elshize Jun 4, 2020
ca09597
Merge branch 'master' into query-container
elshize Jun 5, 2020
f173ea3
Merge branch 'master' into query-container
elshize Jun 5, 2020
104d310
Merge branch 'master' into query-container
elshize Jun 15, 2020
af1871d
Merge branch 'master' into query-container
elshize Jun 18, 2020
8e6da84
Merge branch 'master' into query-container
elshize Jun 24, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -105,6 +105,7 @@ target_link_libraries(pisa PUBLIC # TODO(michal): are there any of these we can
spdlog
fmt::fmt
range-v3
nlohmann_json::nlohmann_json
)
target_include_directories(pisa PUBLIC external)

Expand Down
149 changes: 149 additions & 0 deletions include/pisa/query.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,149 @@
#pragma once

#include <functional>
#include <istream>
#include <memory>
#include <optional>
#include <string>
#include <vector>

#include <gsl/span>

namespace pisa {

struct QueryContainerInner;

struct ResolvedTerm {
std::uint32_t id;
std::string term;
};

using TermProcessorFn = std::function<std::optional<std::string>(std::string)>;
using ParseFn = std::function<std::vector<ResolvedTerm>(std::string const&)>;

class QueryContainer;

/// QueryRequest is a special container that maintains important invariants, such as sorted term
/// IDs, and also has some additional data, like term weights, etc.
class QueryRequest {
public:
explicit QueryRequest(QueryContainer const& data, std::size_t k);

[[nodiscard]] auto term_ids() const -> gsl::span<std::uint32_t const>;
[[nodiscard]] auto threshold() const -> std::optional<float>;
[[nodiscard]] auto k() const -> std::optional<float>;

private:
std::size_t m_k;
std::optional<float> m_threshold{};
std::vector<std::uint32_t> m_term_ids{};
};

class QueryContainer {
public:
QueryContainer(QueryContainer const&);
QueryContainer(QueryContainer&&) noexcept;
QueryContainer& operator=(QueryContainer const&);
QueryContainer& operator=(QueryContainer&&) noexcept;
~QueryContainer();

[[nodiscard]] auto operator==(QueryContainer const& other) const noexcept -> bool;

/// Constructs a query from a raw string.
[[nodiscard]] static auto raw(std::string query_string) -> QueryContainer;

/// Constructs a query from a list of terms.
///
/// \param terms List of terms
/// \param term_processor Function executed for each term before stroring them,
/// e.g., stemming or filtering. This function returns
/// `std::optional<std::string>`, and all `std::nullopt` values
/// will be filtered out.
[[nodiscard]] static auto
from_terms(std::vector<std::string> terms, std::optional<TermProcessorFn> term_processor)
-> QueryContainer;

/// Constructs a query from a list of term IDs.
[[nodiscard]] static auto from_term_ids(std::vector<std::uint32_t> term_ids) -> QueryContainer;

/// Constructs a query from a JSON object.
[[nodiscard]] static auto from_json(std::string_view json_string) -> QueryContainer;

[[nodiscard]] auto to_json() const -> std::string;

/// Constructs a query from a colon-separated format:
///
/// ```
/// id:raw query string
/// ```
/// or
/// ```
/// raw query string
/// ```
[[nodiscard]] static auto from_colon_format(std::string_view line) -> QueryContainer;

// Accessors

[[nodiscard]] auto id() const noexcept -> std::optional<std::string> const&;
[[nodiscard]] auto string() const noexcept -> std::optional<std::string> const&;
[[nodiscard]] auto terms() const noexcept -> std::optional<std::vector<std::string>> const&;
[[nodiscard]] auto term_ids() const noexcept -> std::optional<std::vector<std::uint32_t>> const&;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets not use std::uint32_t unless we decide to prepend std to all the pods

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed width integers like uint32_t are part of the standard library and are located in std namespace. The fact that some headers export them at the root level is not standard. These types are defined without namespaces (for obvious reasons) in the C standard. Compare example in https://en.cppreference.com/w/cpp/types/integer with https://en.cppreference.com/w/c/types/integer In either case, they are not part of the set of fundamental integer types: https://en.cppreference.com/w/cpp/language/types

This has nothing to do with being a POD. struct CustomStruct { int x; } is a POD, yet you would use it just the same as class Complex { /* magic heap stuff going on */ }.

[[nodiscard]] auto threshold(std::size_t k) const noexcept -> std::optional<float>;
[[nodiscard]] auto thresholds() const noexcept
-> std::vector<std::pair<std::size_t, float>> const&;

/// Sets the raw string.
[[nodiscard]] auto string(std::string) -> QueryContainer&;

/// Parses the raw query with the given parser.
///
/// \throws std::domain_error when raw string is not set
auto parse(ParseFn parse_fn) -> QueryContainer&;

/// Sets the query score threshold for `k`.
///
/// If another threshold for the same `k` exists, it will be replaced,
/// and `true` will be returned. Otherwise, `false` will be returned.
auto add_threshold(std::size_t k, float score) -> bool;

/// Returns a query ready to be used for retrieval.
[[nodiscard]] auto query(std::size_t k) const -> QueryRequest;

private:
QueryContainer();
std::unique_ptr<QueryContainerInner> m_data;
};

enum class Format { Json, Colon };

class QueryReader {
public:
/// Open reader from file.
static auto from_file(std::string const& file) -> QueryReader;
/// Open reader from stdin.
static auto from_stdin() -> QueryReader;

/// Read next query or return `nullopt` if stream has ended.
[[nodiscard]] auto next() -> std::optional<QueryContainer>;

/// Execute `fn(q)` for each query `q`.
template <typename Fn>
void for_each(Fn&& fn)
{
auto query = next();
while (query) {
fn(std::move(*query));
query = next();
}
}

private:
explicit QueryReader(std::unique_ptr<std::istream> stream, std::istream& stream_ref);

std::unique_ptr<std::istream> m_stream;
std::istream& m_stream_ref;
std::string m_line_buf{};
std::optional<Format> m_format{};
};

} // namespace pisa
26 changes: 26 additions & 0 deletions include/pisa/query/query_parser.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
#pragma once

#include <string>

#include "query.hpp"
#include "term_resolver.hpp"

namespace pisa {

/// Parses a query string to processed terms.
class QueryParser {
public:
explicit QueryParser(TermResolver term_processor);
/// Given a query string, it returns a list of (possibly processed) terms.
///
/// Possible transformations of terms include lower-casing and stemming.
/// Some terms could be also removed, e.g., because they are on a list of
/// stop words. The exact implementation depends on the term processor
/// passed to the constructor.
auto operator()(std::string const&) -> std::vector<ResolvedTerm>;

private:
TermResolver m_term_resolver;
};

} // namespace pisa
53 changes: 53 additions & 0 deletions include/pisa/query/term_resolver.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
#pragma once

#include <functional>
#include <optional>
#include <string>

#include "query.hpp"

namespace pisa {

/// Thrown if expected resolver but none found.
struct MissingResolverError {
};

using TermResolver = std::function<std::optional<ResolvedTerm>(std::string)>;

struct StandardTermResolverParams;

/// Provides a standard implementation of `TermResolver`.
class StandardTermResolver {
public:
StandardTermResolver(
std::string const& term_lexicon_path,
std::optional<std::string> const& stopwords_filename,
std::optional<std::string> const& stemmer_type);
StandardTermResolver(StandardTermResolver const&);
StandardTermResolver(StandardTermResolver&&) noexcept;
StandardTermResolver& operator=(StandardTermResolver const&);
StandardTermResolver& operator=(StandardTermResolver&&) noexcept;
~StandardTermResolver();

[[nodiscard]] auto operator()(std::string token) const -> std::optional<ResolvedTerm>;

private:
[[nodiscard]] auto is_stopword(std::uint32_t const term) const -> bool;

std::unique_ptr<StandardTermResolverParams> m_self;
};

/// Reads queries from `query_file`, resolves them with `term_resolver`, filters by
/// query length (number of resolved terms in the query), and prints the selected
/// queries to `out`.
///
/// \throws MissingResolverError When no resolver passed but queries don't have IDs resolved.
//
void filter_queries(
std::optional<std::string> const& query_file,
std::optional<TermResolver> term_resolver,
std::size_t min_query_len,
std::size_t max_query_len,
std::ostream& out);

} // namespace pisa
Loading