-
-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Query container #373
Open
elshize
wants to merge
27
commits into
main
Choose a base branch
from
query-container
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Query container #373
Changes from all commits
Commits
Show all changes
27 commits
Select commit
Hold shift + click to select a range
12d5fb6
Query contaier
elshize 1d3aa88
Query container parsing
elshize ce2b720
Merge branch 'master' into query-container
elshize cdc17f3
CLI test
elshize 98fe8b1
Merge branch 'query-container' of github.com:pisa-engine/pisa into qu…
elshize b0e5d1a
Fix .travis.yml syntax
elshize 6e2ab62
Fix .travis.yml syntax
elshize 2cce2cd
Fix when cli test are executed
elshize 982d316
Merge branch 'master' into query-container
elshize 1838258
Refactor out common code from tool
elshize 3ba4588
Merge branch 'master' into query-container
elshize 7107f65
Small refactoring and term resolver tests
elshize ede9c98
Fix tool description
elshize b8f625c
Multiple thresholds per query
elshize 78cf15c
Return program with 1 if fails
elshize d4e63cf
Merge branch 'master' into query-container
elshize ebf1acb
Merge branch 'master' into query-container
elshize 83f9c74
Fix merging issue
elshize 4b0b05c
Merge branch 'master' into query-container
elshize a22f794
Merge branch 'master' into query-container
elshize e0b052a
Merge branch 'master' into query-container
elshize 6382a38
Merge branch 'master' into query-container
elshize ca09597
Merge branch 'master' into query-container
elshize f173ea3
Merge branch 'master' into query-container
elshize 104d310
Merge branch 'master' into query-container
elshize af1871d
Merge branch 'master' into query-container
elshize 8e6da84
Merge branch 'master' into query-container
elshize File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,149 @@ | ||
#pragma once | ||
|
||
#include <functional> | ||
#include <istream> | ||
#include <memory> | ||
#include <optional> | ||
#include <string> | ||
#include <vector> | ||
|
||
#include <gsl/span> | ||
|
||
namespace pisa { | ||
|
||
struct QueryContainerInner; | ||
|
||
struct ResolvedTerm { | ||
std::uint32_t id; | ||
std::string term; | ||
}; | ||
|
||
using TermProcessorFn = std::function<std::optional<std::string>(std::string)>; | ||
using ParseFn = std::function<std::vector<ResolvedTerm>(std::string const&)>; | ||
|
||
class QueryContainer; | ||
|
||
/// QueryRequest is a special container that maintains important invariants, such as sorted term | ||
/// IDs, and also has some additional data, like term weights, etc. | ||
class QueryRequest { | ||
public: | ||
explicit QueryRequest(QueryContainer const& data, std::size_t k); | ||
|
||
[[nodiscard]] auto term_ids() const -> gsl::span<std::uint32_t const>; | ||
[[nodiscard]] auto threshold() const -> std::optional<float>; | ||
[[nodiscard]] auto k() const -> std::optional<float>; | ||
|
||
private: | ||
std::size_t m_k; | ||
std::optional<float> m_threshold{}; | ||
std::vector<std::uint32_t> m_term_ids{}; | ||
}; | ||
|
||
class QueryContainer { | ||
public: | ||
QueryContainer(QueryContainer const&); | ||
QueryContainer(QueryContainer&&) noexcept; | ||
QueryContainer& operator=(QueryContainer const&); | ||
QueryContainer& operator=(QueryContainer&&) noexcept; | ||
~QueryContainer(); | ||
|
||
[[nodiscard]] auto operator==(QueryContainer const& other) const noexcept -> bool; | ||
|
||
/// Constructs a query from a raw string. | ||
[[nodiscard]] static auto raw(std::string query_string) -> QueryContainer; | ||
|
||
/// Constructs a query from a list of terms. | ||
/// | ||
/// \param terms List of terms | ||
/// \param term_processor Function executed for each term before stroring them, | ||
/// e.g., stemming or filtering. This function returns | ||
/// `std::optional<std::string>`, and all `std::nullopt` values | ||
/// will be filtered out. | ||
[[nodiscard]] static auto | ||
from_terms(std::vector<std::string> terms, std::optional<TermProcessorFn> term_processor) | ||
-> QueryContainer; | ||
|
||
/// Constructs a query from a list of term IDs. | ||
[[nodiscard]] static auto from_term_ids(std::vector<std::uint32_t> term_ids) -> QueryContainer; | ||
|
||
/// Constructs a query from a JSON object. | ||
[[nodiscard]] static auto from_json(std::string_view json_string) -> QueryContainer; | ||
|
||
[[nodiscard]] auto to_json() const -> std::string; | ||
|
||
/// Constructs a query from a colon-separated format: | ||
/// | ||
/// ``` | ||
/// id:raw query string | ||
/// ``` | ||
/// or | ||
/// ``` | ||
/// raw query string | ||
/// ``` | ||
[[nodiscard]] static auto from_colon_format(std::string_view line) -> QueryContainer; | ||
|
||
// Accessors | ||
|
||
[[nodiscard]] auto id() const noexcept -> std::optional<std::string> const&; | ||
[[nodiscard]] auto string() const noexcept -> std::optional<std::string> const&; | ||
[[nodiscard]] auto terms() const noexcept -> std::optional<std::vector<std::string>> const&; | ||
[[nodiscard]] auto term_ids() const noexcept -> std::optional<std::vector<std::uint32_t>> const&; | ||
[[nodiscard]] auto threshold(std::size_t k) const noexcept -> std::optional<float>; | ||
[[nodiscard]] auto thresholds() const noexcept | ||
-> std::vector<std::pair<std::size_t, float>> const&; | ||
|
||
/// Sets the raw string. | ||
[[nodiscard]] auto string(std::string) -> QueryContainer&; | ||
|
||
/// Parses the raw query with the given parser. | ||
/// | ||
/// \throws std::domain_error when raw string is not set | ||
auto parse(ParseFn parse_fn) -> QueryContainer&; | ||
|
||
/// Sets the query score threshold for `k`. | ||
/// | ||
/// If another threshold for the same `k` exists, it will be replaced, | ||
/// and `true` will be returned. Otherwise, `false` will be returned. | ||
auto add_threshold(std::size_t k, float score) -> bool; | ||
|
||
/// Returns a query ready to be used for retrieval. | ||
[[nodiscard]] auto query(std::size_t k) const -> QueryRequest; | ||
|
||
private: | ||
QueryContainer(); | ||
std::unique_ptr<QueryContainerInner> m_data; | ||
}; | ||
|
||
enum class Format { Json, Colon }; | ||
|
||
class QueryReader { | ||
public: | ||
/// Open reader from file. | ||
static auto from_file(std::string const& file) -> QueryReader; | ||
/// Open reader from stdin. | ||
static auto from_stdin() -> QueryReader; | ||
|
||
/// Read next query or return `nullopt` if stream has ended. | ||
[[nodiscard]] auto next() -> std::optional<QueryContainer>; | ||
|
||
/// Execute `fn(q)` for each query `q`. | ||
template <typename Fn> | ||
void for_each(Fn&& fn) | ||
{ | ||
auto query = next(); | ||
while (query) { | ||
fn(std::move(*query)); | ||
query = next(); | ||
} | ||
} | ||
|
||
private: | ||
explicit QueryReader(std::unique_ptr<std::istream> stream, std::istream& stream_ref); | ||
|
||
std::unique_ptr<std::istream> m_stream; | ||
std::istream& m_stream_ref; | ||
std::string m_line_buf{}; | ||
std::optional<Format> m_format{}; | ||
}; | ||
|
||
} // namespace pisa |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,26 @@ | ||
#pragma once | ||
|
||
#include <string> | ||
|
||
#include "query.hpp" | ||
#include "term_resolver.hpp" | ||
|
||
namespace pisa { | ||
|
||
/// Parses a query string to processed terms. | ||
class QueryParser { | ||
public: | ||
explicit QueryParser(TermResolver term_processor); | ||
/// Given a query string, it returns a list of (possibly processed) terms. | ||
/// | ||
/// Possible transformations of terms include lower-casing and stemming. | ||
/// Some terms could be also removed, e.g., because they are on a list of | ||
/// stop words. The exact implementation depends on the term processor | ||
/// passed to the constructor. | ||
auto operator()(std::string const&) -> std::vector<ResolvedTerm>; | ||
|
||
private: | ||
TermResolver m_term_resolver; | ||
}; | ||
|
||
} // namespace pisa |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,53 @@ | ||
#pragma once | ||
|
||
#include <functional> | ||
#include <optional> | ||
#include <string> | ||
|
||
#include "query.hpp" | ||
|
||
namespace pisa { | ||
|
||
/// Thrown if expected resolver but none found. | ||
struct MissingResolverError { | ||
}; | ||
|
||
using TermResolver = std::function<std::optional<ResolvedTerm>(std::string)>; | ||
|
||
struct StandardTermResolverParams; | ||
|
||
/// Provides a standard implementation of `TermResolver`. | ||
class StandardTermResolver { | ||
public: | ||
StandardTermResolver( | ||
std::string const& term_lexicon_path, | ||
std::optional<std::string> const& stopwords_filename, | ||
std::optional<std::string> const& stemmer_type); | ||
StandardTermResolver(StandardTermResolver const&); | ||
StandardTermResolver(StandardTermResolver&&) noexcept; | ||
StandardTermResolver& operator=(StandardTermResolver const&); | ||
StandardTermResolver& operator=(StandardTermResolver&&) noexcept; | ||
~StandardTermResolver(); | ||
|
||
[[nodiscard]] auto operator()(std::string token) const -> std::optional<ResolvedTerm>; | ||
|
||
private: | ||
[[nodiscard]] auto is_stopword(std::uint32_t const term) const -> bool; | ||
|
||
std::unique_ptr<StandardTermResolverParams> m_self; | ||
}; | ||
|
||
/// Reads queries from `query_file`, resolves them with `term_resolver`, filters by | ||
/// query length (number of resolved terms in the query), and prints the selected | ||
/// queries to `out`. | ||
/// | ||
/// \throws MissingResolverError When no resolver passed but queries don't have IDs resolved. | ||
// | ||
void filter_queries( | ||
std::optional<std::string> const& query_file, | ||
std::optional<TermResolver> term_resolver, | ||
std::size_t min_query_len, | ||
std::size_t max_query_len, | ||
std::ostream& out); | ||
|
||
} // namespace pisa |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets not use
std::uint32_t
unless we decide to prependstd
to all the podsThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed width integers like
uint32_t
are part of the standard library and are located instd
namespace. The fact that some headers export them at the root level is not standard. These types are defined without namespaces (for obvious reasons) in the C standard. Compare example in https://en.cppreference.com/w/cpp/types/integer with https://en.cppreference.com/w/c/types/integer In either case, they are not part of the set of fundamental integer types: https://en.cppreference.com/w/cpp/language/typesThis has nothing to do with being a POD.
struct CustomStruct { int x; }
is a POD, yet you would use it just the same asclass Complex { /* magic heap stuff going on */ }
.