Matching HS Codes in 2024: Traversing the Customs Space

HS (Harmonised System) Codes help classify customs authorities and businesses around the world to ensure appropriate import/export control. This system recursively classifies a product into finer granularity of categories.

Figure 1: A simple exercise of coding a RAM module.

Could we do this differently with the linguistic tools like LLMs in 2024? This blog examines different approaches towards classification.

Methodology of Classification

Assigning a HS Code is done hierarchically.

Starting off from Chapters, to Sections to further granularities. If we consider this as a search problem, the search domain keeps shrinking as we continue the “process of selection”. i.e., as soon as we classify a product to be an electrical component, we don’t need to keep Wines in the search space.

A graph could be a great solution to this problem!

Resources at our Disposal

We are focusing on classifying according to the definitions set by Singapore Customs.

Constructing a Graph

STCCED, as the name suggests, contains the hierarchical classification of traded goods. It is a PDF file, and the text is divided into Sections, Chapters and “Subchapters”.

Step 1: ⬇️ Download the STCCED 2022 PDF, use PyPDF2 to extract the text content.

Step 1: Raw text content of STCCED

Step 2: ✂️ Split the text content recursively, into Sections, Chapters and “Subchapters”.

Step 2: The recursively split STCCED!

Step 3: Convert all the subchapters into GraphViz dot files. (Compact representation of parent-child relationships). I used gpt-4 to read the sections and construct these subgraphs.

Super-interesting prompt that converted chapter text into GraphViz

Step 4: Merge the subgraphs hierarchically. The result, is a Graph of 13K+ Nodes, neatly organised!

Zooming into the STCCED 2022 Graph!

For each granularity level (Section, Chapter, Subchapter, Dash and Double Dash), I sent requests to gpt-4 directly with a list of the child nodes. Perhaps this illustration will be of assistance!

Illustration of traversing root -> Section -> Chapter…

Thus, progressive granularity would shrink the total search space down.

Giving it some LLM and Streamlit

For the first build, I used openai directly and got streamlit to create the user-interface.

Try it now on Streamlit:

Demo: 🌏
GitHub: 🔗

Limitations and Future Work

  1. The current implementation does not go backwards to course-correct. Thus it continues on with a bad classification in a higher granularity.

  2. More tokens may be saved by combining vector search along-side graphs for nodes with larger number of children.