2 Star 1 Fork 1

赤琦 / MSMARCO-Passage-Ranking

加入 Gitee
与超过 1200万 开发者一起发现、参与优秀开源项目,私有仓库也完全免费 :)
免费加入
克隆/下载
get_all_passages.py 1.23 KB
一键复制 编辑 原始数据 按行查看 历史
Your Name 提交于 2019-05-16 17:08 . adding files
import pandas as pd
import sys
def main(train_filename, dev_filename, eval_filename):
train = pd.read_json(train_filename)
dev = pd.read_json(dev_filename)
eval = pd.read_json(eval_filename)
passages = {}
pid = 0
for row in train.itterrows():
for passage in row[1]['passages']:
if passage['passage_text'] not in passages:
passages[passage['passage_text']] = pid
pid += 1
for row in dev.itterrows():
for passage in row[1]['passages']:
if passage['passage_text'] not in passages:
passages[passage['passage_text']] = pid
pid += 1
for row in eval.itterrows():
for passage in row[1]['passages']:
if passage['passage_text'] not in passages:
passages[passage['passage_text']] = pid
pid += 1
with open(output_filename, 'w') as w:
for passage in passages:
w.write("{}\t{}\n".format(passage, passages[passage])
print("{} unique passages found".format(str(pid+1)))
if __name__ == '__main__':
if len(sys.argv) != 5:
print("Usage: get_all_passages.py <train_file> <dev_file> <eval_file> <output_filename>")
exit(-1)
else:
main(sys.argv[1], sys.argv[2], sys.argv[3], sys.argv[4])
1
https://gitee.com/redblue9771/MSMARCO-Passage-Ranking.git
git@gitee.com:redblue9771/MSMARCO-Passage-Ranking.git
redblue9771
MSMARCO-Passage-Ranking
MSMARCO-Passage-Ranking
master

搜索帮助