Customizing your workflow

In addition to the features provided by BlockchainSpider, you can customize the workflow of the spider by configuring the pipeline.

Note: this page assumes that you already have an understanding of pipeline in Scrapy, and the transaction semantic representation technique.

Taking transaction spider as an example, we discuss how to calculate transaction semantic vectors during the process of crawling transactions by the block order.

Firstly, you need to define your own pipeline (see MoTSPipeline) that can process continuously synchronized block data:

from BlockchainSpider.items import SyncItem, TransactionItem, TraceItem, \
    Token721TransferItem, Token20TransferItem, Token1155TransferItem
    
class MoTSPipeline:
    def __init__(self):
        ...

    def process_item(self, item, spider):
        if self.file is None:
            return item
        if not isinstance(item, SyncItem):
            return item

        # collect money transfer items
        # the 'data' field in SyncItem is a dict,
        # where keys are parsed item class names,
        # and values are parsed items.
        # all the items in a SyncItem is parsed from the same block
        txhash2edges = dict()
        transfer_type_names = [
            cls.__name__ for cls in [
                TransactionItem, TraceItem,
                Token721TransferItem, Token20TransferItem,
                Token1155TransferItem,
            ]
        ]
        for name in transfer_type_names:
            if not item['data'].get(name):
                continue
            for transfer_item in item['data'][name]:
                txhash = transfer_item['transaction_hash']
                if not txhash2edges.get(txhash):
                    txhash2edges[txhash] = list()
                txhash2edges[txhash].append({
                    'address_from': transfer_item['address_from'],
                    'address_to': transfer_item['address_to'],
                })

        # create calc vec task
        vecs = list()
        for txhash, edges in txhash2edges.items():
            vec = HighOrderMotifCounter(motif_size=4).count(edges)
            vecs.append(vec)

        # start the tasks
        for txhash, vec in zip(txhashes, vecs):
            vec_list = [vec[i] for i in range(1, 16 + 1)]
            self.writer.writerow([txhash, *vec_list])
        return item

Next, enable the pipeline in the settings:

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
SPIDER_MIDDLEWARES = {
   'contrib.mots.middlewares.MoTSMiddleware': 500,
}

Finally, the following command can help you start the transaction spider, which can calculate and save the semantic vector of each transaction during the process of crawling transaction data:

scrapy crawl trans.block.evm
-a out=/path/to/output/data \
-a start_blk=19000000 -a end_blk=19001000 \
-a providers=https://freerpc.merkle.io \
-a enable=BlockchainSpider.middlewares.trans.TransactionReceiptMiddleware,BlockchainSpider.middlewares.trans.TraceMiddleware,BlockchainSpider.middlewares.trans.TokenTransferMiddleware

Last updated