-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
build_faiss command #601
base: dev
Are you sure you want to change the base?
build_faiss command #601
Conversation
Hello @nicklein! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:
Comment last updated at 2022-05-19 22:08:31 UTC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @nicklein I added some comments. Please make the changes.
I'll take another closer look at the code logic wise after you make the required changes
kgtk/cli/build_faiss.py
Outdated
|
||
# REQUIRED # | ||
# Related to input file | ||
parser.add_argument('-i', '--input_file', '--embeddings_file', action='store', dest='embeddings_file', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a add_input_file
function, please use it
https://github.com/usc-isi-i2/kgtk/blob/master/kgtk/cli/cat.py#L45
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
kgtk/cli/build_faiss.py
Outdated
help='Input file containing the embeddings for which a Faiss index will be created.') | ||
|
||
# Related to output | ||
parser.add_argument('-o', '--output_file', '--index_file_out', action='store', dest='index_file_out', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also a add_output_file()
function
https://github.com/usc-isi-i2/kgtk/blob/master/kgtk/cli/cat.py#L50
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
kgtk/cli/build_faiss.py
Outdated
required=True, help="Output .idx file where the index fill be saved.", | ||
metavar="INDEX_FILE_OUT") | ||
|
||
parser.add_argument('-id2n', '--index_to_node_file_out', action='store', dest='index_to_node_file_out', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all the secondary options should only have -
, instead of underscore.
--index_to_node_file_out
should be --index-to-node-file-out
.
Also as a general rule, do not use stop words like to
in the parameter name, This review comment applies to all the subsequent parameters.
The dest
parameter is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the reason to avoid stop words simply for brevity? I am struggling to think of a name for this that omits 'to' and isn't ambiguous. This parameter specifies the path where a kgtk file will be saved. The kgtk file contains a mapping of index to corresponding node. Do you have a suggestion for this parameter name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you explain a bit more as to what exactly this file contains? mapping of what index ?how is this different from the output file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you use a faiss index to search for nearest neighbors, it returns distances and corresponding indexes/IDs of the nearest neighbors, rather than the names or embeddings of the nearest neighbors. This file would allow you to look up the entity name that corresponds to the index/ID. Here's an example of what the file would look like:
Input embedding file:
q1 <embedding>
q2 <embedding>
q3 <embedding>
output index_to_node file:
node1 label node2
0 index_to_node q1
1 index_to_node q2
2 index_to_node q3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose I could use the word ID rather than index here to avoid confusion. Then I could call this 'node_id_file_out' to avoid 'to'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i see, using the stop word to
makes sense here. I would call this --faiss-id-to-node-mapping-file
As you can, brevity is not the issue, meaningful name is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
kgtk/cli/build_faiss.py
Outdated
except SystemExit as e: | ||
raise KGTKException("Exit requested") | ||
except KGTKException as e: | ||
raise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
raise the caught exception
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
kgtk/cli/build_faiss.py
Outdated
""" | ||
Train and populate a faiss index that can compute nearest neighbors of given embeddings. | ||
""" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename the command and the file in cli
to build-faiss-index`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
kgtk/graph_embeddings/build_faiss.py
Outdated
def build_faiss(embeddings_file, embeddings_format, no_input_header, index_file_out, index_to_node_file_out, | ||
max_train_examples, workers, index_string, metric_type, p=None, verbose=False): | ||
|
||
# validate input file path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use KGTKReader
to read the input file as it is an edge file. It handles a lot of exceptions and cases. Take a look at the cat
command for example.
kgtk/graph_embeddings/build_faiss.py
Outdated
|
||
# validate metric type and translate to a faiss metric | ||
metrics = { | ||
"Inner_product": faiss.METRIC_INNER_PRODUCT, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sure this is case insensitive
kgtk/graph_embeddings/build_faiss.py
Outdated
else: | ||
raise KGTKException("Unrecognized value for metric_type parameter: {}.".format(metric_type) + | ||
"Please choose one of {}.".format(" | ".join(list(metrics.keys())))) | ||
if metric_type == "Lp" and p is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is p
? Use a descriptive name please
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the Lp distance metric is specified, then the user also needs to specify the value of p that they want to use, e.g. L1, L2, L...
I'll change it to metric_arg though since that is what Faiss calls it.
kgtk/graph_embeddings/build_faiss.py
Outdated
print("Writing index-to-node file...") | ||
with open(embeddings_file, 'r') as f_in: | ||
with open(index_to_node_file_out, 'w+') as f_out: | ||
f_out.write("node1\tlabel\tnode2\n") # header |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use KGTKWriter
instead
kgtk/graph_embeddings/build_faiss.py
Outdated
# Load training examples for index training | ||
train_vecs = [] | ||
if verbose: | ||
print("Loading training vectors...") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
print to an error_file
, see example in cat
. No printing to standard out
@nicklein ETA on the requested changes? |
New kgtk command for building a faiss index. Intended for use with graph embeddings.