BlockchainSpider
  • What is BlockchainSpider?
  • Guides
    • Crawl a transaction subgraph
    • Collect label data
    • Collect transaction data
  • subgraph spiders
    • Overview
    • BFS
    • Poison
    • Haircut
    • APPR
    • TTR
  • label spiders
    • Overview
    • CryptoScamsDB
    • LabelCloud
    • OFAC
    • Tor
  • Transaction spiders
    • Overview
    • Collect by block order
    • Collect by transaction hash
  • Extractors
    • Overview
    • Deduplicate
    • Local community
  • Settings
    • APIKeys
    • Cache
    • Customizing your workflow
Powered by GitBook
On this page
  1. Settings

Customizing your workflow

PreviousCache

Last updated 7 months ago

In addition to the features provided by BlockchainSpider, you can customize the workflow of the spider by configuring the pipeline.

Note: this page assumes that you already have an understanding of in Scrapy, and the .

Taking transaction spider as an example, we discuss how to calculate transaction semantic vectors during the process of crawling transactions by the block order.

Firstly, you need to define your own pipeline (see ) that can process continuously synchronized block data:

from BlockchainSpider.items import SyncItem, TransactionItem, TraceItem, \
    Token721TransferItem, Token20TransferItem, Token1155TransferItem
    
class MoTSPipeline:
    def __init__(self):
        ...

    def process_item(self, item, spider):
        if self.file is None:
            return item
        if not isinstance(item, SyncItem):
            return item

        # collect money transfer items
        # the 'data' field in SyncItem is a dict,
        # where keys are parsed item class names,
        # and values are parsed items.
        # all the items in a SyncItem is parsed from the same block
        txhash2edges = dict()
        transfer_type_names = [
            cls.__name__ for cls in [
                TransactionItem, TraceItem,
                Token721TransferItem, Token20TransferItem,
                Token1155TransferItem,
            ]
        ]
        for name in transfer_type_names:
            if not item['data'].get(name):
                continue
            for transfer_item in item['data'][name]:
                txhash = transfer_item['transaction_hash']
                if not txhash2edges.get(txhash):
                    txhash2edges[txhash] = list()
                txhash2edges[txhash].append({
                    'address_from': transfer_item['address_from'],
                    'address_to': transfer_item['address_to'],
                })

        # create calc vec task
        vecs = list()
        for txhash, edges in txhash2edges.items():
            vec = HighOrderMotifCounter(motif_size=4).count(edges)
            vecs.append(vec)

        # start the tasks
        for txhash, vec in zip(txhashes, vecs):
            vec_list = [vec[i] for i in range(1, 16 + 1)]
            self.writer.writerow([txhash, *vec_list])
        return item

Next, enable the pipeline in the settings:

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
SPIDER_MIDDLEWARES = {
   'contrib.mots.middlewares.MoTSMiddleware': 500,
}

Finally, the following command can help you start the transaction spider, which can calculate and save the semantic vector of each transaction during the process of crawling transaction data:

scrapy crawl trans.block.evm
-a out=/path/to/output/data \
-a start_blk=19000000 -a end_blk=19001000 \
-a providers=https://freerpc.merkle.io \
-a enable=BlockchainSpider.middlewares.trans.TransactionReceiptMiddleware,BlockchainSpider.middlewares.trans.TraceMiddleware,BlockchainSpider.middlewares.trans.TokenTransferMiddleware
pipeline
transaction semantic representation technique
MoTSPipeline