Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

build_faiss command #601

Open
wants to merge 16 commits into
base: dev
Choose a base branch
from
Open

Conversation

nicklein
Copy link
Collaborator

New kgtk command for building a faiss index. Intended for use with graph embeddings.

@pep8speaks
Copy link

pep8speaks commented Jan 13, 2022

Hello @nicklein! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 38:121: E501 line too long (121 > 120 characters)

Comment last updated at 2022-05-19 22:08:31 UTC

Copy link
Member

@saggu saggu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @nicklein I added some comments. Please make the changes.

I'll take another closer look at the code logic wise after you make the required changes


# REQUIRED #
# Related to input file
parser.add_argument('-i', '--input_file', '--embeddings_file', action='store', dest='embeddings_file',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a add_input_file function, please use it
https://github.com/usc-isi-i2/kgtk/blob/master/kgtk/cli/cat.py#L45

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

help='Input file containing the embeddings for which a Faiss index will be created.')

# Related to output
parser.add_argument('-o', '--output_file', '--index_file_out', action='store', dest='index_file_out',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

required=True, help="Output .idx file where the index fill be saved.",
metavar="INDEX_FILE_OUT")

parser.add_argument('-id2n', '--index_to_node_file_out', action='store', dest='index_to_node_file_out',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all the secondary options should only have -, instead of underscore.

--index_to_node_file_out should be --index-to-node-file-out.

Also as a general rule, do not use stop words like to in the parameter name, This review comment applies to all the subsequent parameters.

The dest parameter is fine.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the reason to avoid stop words simply for brevity? I am struggling to think of a name for this that omits 'to' and isn't ambiguous. This parameter specifies the path where a kgtk file will be saved. The kgtk file contains a mapping of index to corresponding node. Do you have a suggestion for this parameter name?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you explain a bit more as to what exactly this file contains? mapping of what index ?how is this different from the output file?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you use a faiss index to search for nearest neighbors, it returns distances and corresponding indexes/IDs of the nearest neighbors, rather than the names or embeddings of the nearest neighbors. This file would allow you to look up the entity name that corresponds to the index/ID. Here's an example of what the file would look like:

Input embedding file:

q1 <embedding>
q2 <embedding>
q3 <embedding>

output index_to_node file:
node1 label node2
0 index_to_node q1
1 index_to_node q2
2 index_to_node q3

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose I could use the word ID rather than index here to avoid confusion. Then I could call this 'node_id_file_out' to avoid 'to'.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i see, using the stop word to makes sense here. I would call this --faiss-id-to-node-mapping-file As you can, brevity is not the issue, meaningful name is.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

except SystemExit as e:
raise KGTKException("Exit requested")
except KGTKException as e:
raise
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

raise the caught exception

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

"""
Train and populate a faiss index that can compute nearest neighbors of given embeddings.
"""

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename the command and the file in cli to build-faiss-index`

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

def build_faiss(embeddings_file, embeddings_format, no_input_header, index_file_out, index_to_node_file_out,
max_train_examples, workers, index_string, metric_type, p=None, verbose=False):

# validate input file path
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use KGTKReader to read the input file as it is an edge file. It handles a lot of exceptions and cases. Take a look at the cat command for example.


# validate metric type and translate to a faiss metric
metrics = {
"Inner_product": faiss.METRIC_INNER_PRODUCT,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sure this is case insensitive

else:
raise KGTKException("Unrecognized value for metric_type parameter: {}.".format(metric_type) +
"Please choose one of {}.".format(" | ".join(list(metrics.keys()))))
if metric_type == "Lp" and p is None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is p? Use a descriptive name please

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the Lp distance metric is specified, then the user also needs to specify the value of p that they want to use, e.g. L1, L2, L...
I'll change it to metric_arg though since that is what Faiss calls it.

print("Writing index-to-node file...")
with open(embeddings_file, 'r') as f_in:
with open(index_to_node_file_out, 'w+') as f_out:
f_out.write("node1\tlabel\tnode2\n") # header
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use KGTKWriter instead

# Load training examples for index training
train_vecs = []
if verbose:
print("Loading training vectors...")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

print to an error_file, see example in cat. No printing to standard out

@saggu
Copy link
Member

saggu commented Mar 1, 2022

@nicklein ETA on the requested changes?

@nicklein
Copy link
Collaborator Author

nicklein commented Mar 3, 2022

@nicklein ETA on the requested changes?

Hey @saggu, sorry about the slowness on this, it ended up becoming lower priority amongst my list of todos. I've talked with Filip/Pedro about this and I will try to get the changes wrapped up soon, either end of this week, or beginning of next.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants