JPO Patent Corpus was constructed by the Japan Patent Office (JPO). This corpus consists of a Chinese-Japanese, Korean-Japanese, and English-Japanese patent description corpus of 1M parallel sentences.
Most setences in the corpus were from four International Patent Classification (IPC) sections:
Cemistry (Ch), Electricity (El), Mechanical engineering (Me), and Physics (Ph).
Differing from the previous patent tasks at WAT2018-2021, the ko-ja test-N2 set was removed due to a technical problem, new test-N4 sets have been added, and test-2022 sets have been updated instead of previous test-N sets.
Corpus statistics:
Language Pair | Data Type | File Name | Size | Sections:Ratios | Published Years | Sentence Alignment |
---|---|---|---|---|---|---|
ZH<-->JA | TRAIN | train.{zh,ja} | 1,000,000 | Ch/El/Me/Ph:25%/25%/25%/25% | 2011-2013 | Automatic |
DEV | dev.{zh,ja} | 2,000 | ||||
DEVTEST | devtest.{zh,ja} | 2,000 | ||||
TEST | test-n1.{zh,ja} | 2,000 | ||||
TEST | test-n2.{zh,ja} | 3,000 | Ch/El/Me/Ph:Unknown | 2016-2017 | Automatic | |
TEST | test-n3.{zh,ja} | 204 | Manual | |||
TEST | test-n4.{zh,ja} | 5,000 | Uncontrolled | 2019-2020 | Automatic | |
TEST | test-2022.{zh,ja} | 10,204 | 2011-2020 | Automatic/Manual | ||
KO<-->JA | TRAIN | train.{ko,ja} | 1,000,000 | Ch/El/Me/Ph:25%/25%/25%/25% | 2011-2013 | Automatic |
DEV | dev.{ko,ja} | 2,000 | ||||
DEVTEST | devtest.{ko,ja} | 2,000 | ||||
TEST | test-n1.{ko,ja} | 2,000 | ||||
TEST | test-n2.{ko,ja} | 0 | N/A | |||
TEST | test-n3.{ko,ja} | 230 | Ch/El/Me/Ph:Unknown | 2016-2017 | Manual | |
TEST | test-n4.{ko,ja} | 5,000 | Uncontrolled | 2019-2020 | Automatic | |
TEST | test-2022.{ko,ja} | 7,230 | 2011-2020 | Automatic/Manual | ||
EN<-->JA | TRAIN | train.{en,ja} | 1,000,000 | Ch/El/Me/Ph:25%/25%/25%/25% | 2011-2013 | Automatic |
DEV | dev.{en,ja} | 2,000 | ||||
DEVTEST | devtest.{en,ja} | 2,000 | ||||
TEST | test-n1.{en,ja} | 2,000 | ||||
TEST | test-n2.{en,ja} | 3,000 | Ch/El/Me/Ph:Unknown | 2016-2017 | Automatic | |
TEST | test-n3.{en,ja} | 668 | Manual | |||
TEST | test-n4.{en,ja} | 5,000 | Uncontrolled | 2019-2020 | Automatic | |
TEST | test-2022.{en,ja} | 10,668 | 2011-2020 | Automatic/Manual |
For questions, comments, etc. please email to "wat-organizer -at- googlegroups -dot- com".
2023-04-20: "HOW TO OBTAIN" was updated. 2022-05-18: "HOW TO OBTAIN" was updated. 2022-04-27: Training data size shown in corpus statistics was modified. 2021-03-30: Site opened.
NICT (National Institute of Information and Communications Technology)
Kyoto University
Last Modified: 2023-04-20