Skip to content

PyTorch Huggingface BERT-NLP for Named Entity Recognition #328

Closed
@AshwinAmbal

Description

@AshwinAmbal

I have been using your PyTorch implementation of Google’s BERT by HuggingFace for the MADE 1.0 dataset for quite some time now. Up until last time (11-Feb), I had been using the library and getting an F-Score of 0.81 for my Named Entity Recognition task by Fine Tuning the model. But this week when I ran the exact same code which had compiled and run earlier, it threw an error when executing this statement:

input_ids = pad_sequences([tokenizer.convert_tokens_to_ids(txt) for txt in tokenized_texts], maxlen=MAX_LEN, dtype=”long”, truncating=”post”, padding=”post”)

ValueError: Token indices sequence length is longer than the specified
maximum sequence length for this BERT model (632 > 512). Running this
sequence through BERT will result in indexing errors

The full code is available in this colab notebook.

To get around this error I modified the above statement to the one below by taking the first 512 tokens of any sequence and made the necessary changes to add the index of [SEP] to the end of the truncated/padded sequence as required by BERT.

input_ids = pad_sequences([tokenizer.convert_tokens_to_ids(txt[:512]) for txt in tokenized_texts], maxlen=MAX_LEN, dtype=”long”, truncating=”post”, padding=”post”)

The result shouldn’t have changed because I am only considering the first 512 tokens in the sequence and later truncating to 75 as my (MAX_LEN=75) but my F-Score has dropped to 0.40 and my precision to 0.27 while the Recall remains the same (0.85). I am unable to share the dataset as I have signed a confidentiality clause but I can assure all the preprocessing as required by BERT has been done and all extended tokens like (Johanson –> Johan ##son) have been tagged with X and replaced later after the prediction as said in the BERT Paper.

Has anyone else faced a similar issue or can elaborate on what might be the issue or what changes the PyTorch (Huggingface) has done on their end recently?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions