【Python】根据CID获取化合物数据(调用Pubchem官方API)

  • Post author:
  • Post category:python





简介

根据

CID



PubChem

爬取化合物的数据(基于

PubChem PUG REST API

),2~3秒即可实现对上千条CID对应的化合物数据的抓取。



下载

小编已将程序打包为可执行文件,下载即可使用:

pubchem-1.0.2-win64.zip



演示

在这里插入图片描述



非开发人员直接下载打包好的软件使用即可,无需继续往下看(以此为分界线),如有问题请联系我。




安装

pip install requests



用法

  1. 克隆仓库。
git clone https://github.com/XavierJiezou/python-pubchem-api.git

  1. Cd

    到根目录。
cd python-pubchem-api


  1. cid

    列表复制到

    cid.txt

  2. 运行命令

    python pubchem.py

    .
  3. 爬取结果保存在

    data.json

    或者

    data.csv

    .
  4. 你也可以根据下面的

    化合物属性表

    修改

    pubchem.py

    中的变量

    self.property_list
self.property_list = [
            'IUPACName',
            'IsomericSMILES',
            'MolecularFormula',
            'MolecularWeight',
            'HBondDonorCount',
            'HBondAcceptorCount'
        ]



相关



化合物属性表

如果将以逗号分隔的属性标签列表写入URL中,则可以请求多个属性。属性表的有效输出格式为:XML、ASNT/B、JSON§、CSV和TXT(仅限于单个属性)。可用的属性包括:


属性

描述
MolecularFormula Molecular formula.
MolecularWeight The molecular weight is the sum of all atomic weights of the constituent atoms in a compound, measured in g/mol. In the absence of explicit isotope labelling, averaged natural abundance is assumed. If an atom bears an explicit isotope label, 100% isotopic purity is assumed at this location.
CanonicalSMILES Canonical SMILES (Simplified Molecular Input Line Entry System) string. It is a unique SMILES string of a compound, generated by a “canonicalization” algorithm.
IsomericSMILES Isomeric SMILES string. It is a SMILES string with stereochemical and isotopic specifications.
InChI Standard IUPAC International Chemical Identifier (InChI). It does not allow for user selectable options in dealing with the stereochemistry and tautomer layers of the InChI string.
InChIKey Hashed version of the full standard InChI, consisting of 27 characters.
IUPACName Chemical name systematically determined according to the IUPAC nomenclatures.
Title The title used for the compound summary page.
XLogP Computationally generated octanol-water partition coefficient or distribution coefficient. XLogP is used as a measure of hydrophilicity or hydrophobicity of a molecule.
ExactMass The mass of the most likely isotopic composition for a single molecule, corresponding to the most intense ion/molecule peak in a mass spectrum.
MonoisotopicMass The mass of a molecule, calculated using the mass of the most abundant isotope of each element.
TPSA Topological polar surface area, computed by the algorithm described in the paper by Ertl et al.
Complexity The molecular complexity rating of a compound, computed using the Bertz/Hendrickson/Ihlenfeldt formula.
Charge The total (or net) charge of a molecule.
HBondDonorCount Number of hydrogen-bond donors in the structure.
HBondAcceptorCount Number of hydrogen-bond acceptors in the structure.
RotatableBondCount Number of rotatable bonds.
HeavyAtomCount Number of non-hydrogen atoms.
IsotopeAtomCount Number of atoms with enriched isotope(s)
AtomStereoCount Total number of atoms with tetrahedral (sp3) stereo [e.g., ®- or (S)-configuration]
DefinedAtomStereoCount Number of atoms with defined tetrahedral (sp3) stereo.
UndefinedAtomStereoCount Number of atoms with undefined tetrahedral (sp3) stereo.
BondStereoCount Total number of bonds with planar (sp2) stereo [e.g., (E)- or (Z)-configuration].
DefinedBondStereoCount Number of atoms with defined planar (sp2) stereo.
UndefinedBondStereoCount Number of atoms with undefined planar (sp2) stereo.
CovalentUnitCount Number of covalently bound units.
Volume3D Analytic volume of the first diverse conformer (default conformer) for a compound.
XStericQuadrupole3D The x component of the quadrupole moment (Qx) of the first diverse conformer (default conformer) for a compound.
YStericQuadrupole3D The y component of the quadrupole moment (Qy) of the first diverse conformer (default conformer) for a compound.
ZStericQuadrupole3D The z component of the quadrupole moment (Qz) of the first diverse conformer (default conformer) for a compound.
FeatureCount3D Total number of 3D features (the sum of FeatureAcceptorCount3D, FeatureDonorCount3D, FeatureAnionCount3D, FeatureCationCount3D, FeatureRingCount3D and FeatureHydrophobeCount3D)
FeatureAcceptorCount3D Number of hydrogen-bond acceptors of a conformer
FeatureDonorCount3D Number of hydrogen-bond donors of a conformer.
FeatureAnionCount3D Number of anionic centers (at pH 7) of a conformer.
FeatureCationCount3D Number of cationic centers (at pH 7) of a conformer.
FeatureRingCount3D Number of rings of a conformer.
FeatureHydrophobeCount3D Number of hydrophobes of a conformer.
ConformerModelRMSD3D Conformer sampling RMSD in
EffectiveRotorCount3D Total number of 3D features (the sum of FeatureAcceptorCount3D, FeatureDonorCount3D, FeatureAnionCount3D, FeatureCationCount3D, FeatureRingCount3D and FeatureHydrophobeCount3D)
ConformerCount3D The number of conformers in the conformer model for a compound.
Fingerprint2D Base64-encoded PubChem Substructure Fingerprint of a molecule.



属性API

根据

CID

获取属性。



实例:



https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/1,2,3,4,5/property/MolecularFormula,MolecularWeight/JSON



同义词API

根据

CID

获取同义词。



实例:



https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/1,2,3,4,5/synonyms/JSON



打包

git clone https://github.com/XavierJiezou/python-pubchem-api.git
cd python-pubchem-api
pip install pipenv
pipenv install
pipenv shell
pip install requests
pip install pyinstaller
pyinstaller -F -i favicon.ico pubchem.py



源码


https://github.com/XavierJiezou/python-pubchem-api

import os, csv, json, requests


class PubchemCrawlFast():
    def __init__(self, cid_path, out_path):
        """Initialization function.

        Args:
            cid_path (str): Input file path of cid list
            out_path (str): Output file path of crawled data 
        """
        self.cid_path = cid_path
        self.out_path = out_path
        self.property_list = [
            'IUPACName',
            'IsomericSMILES',
            'MolecularFormula',
            'MolecularWeight',
            'HBondDonorCount',
            'HBondAcceptorCount'
        ]

    def get_cid_list(self):
        """Get the cid list from the local file
        """
        if os.path.exists(self.cid_path):
            with open(self.cid_path) as f:
                self.cid_list = [i.strip() for i in f.readlines()]
        else:
            self.cid_list = []
            cid = input('Please inpute the CID list below: \n')
            while cid != '':
                self.cid_list.append(cid)
                cid = input()
        self.length = len(self.cid_list)

    def get_property_from_cid(self):
        """Get the property from cid
        """
        limit = 300
        api = 'https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/'
        property_str = ','.join(self.property_list)
        return_type = 'json'
        self.prp = []
        for i in range(limit, self.length+limit, limit):
            cid_str = ','.join(self.cid_list[i-limit:i])
            url = f'{api}{cid_str}/property/{property_str}/{return_type}'
            res = requests.get(url).json()
            self.prp += res['PropertyTable']['Properties']

    def get_synonyms_from_cid(self):
        """Get the synonym from cid
        """
        limit = 300
        api = 'https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/'
        return_type = 'json'
        self.syn = []
        for i in range(limit, self.length+limit, limit):
            cid_str = ','.join(self.cid_list[i-limit:i])
            url = f'{api}{cid_str}/synonyms/{return_type}'
            res = requests.get(url).json()
            self.syn += res['InformationList']['Information']
        for i in range(len(self.syn)):
            if 'Synonym' not in self.syn[i]:
                self.syn[i]['Synonym'] = []

    def save_as_csv(self, data):
        """Save the crawled data in CSV format
        """
        csv_name = self.out_path.split('.')[0]+'.csv'
        header_list = ['CID']+self.property_list+['Synonym']
        # with open(csv_name, 'w') as f:
        #     f.write(','.join(header_list)+'\n')
        # with open(csv_name, 'a') as f:
        #     for item in data:
        #         line = ['"'+str(item[each])+'"' for each in header_list]
        #         f.write(','.join(line)+'\n')
        with open(csv_name,'w', newline='') as f:
            writer = csv.DictWriter(f, header_list)
            writer.writeheader()
            writer.writerows(data)

    def __main__(self):
        print('Getting CID list: ')
        self.get_cid_list()
        print('CID list acquisition is complete!')
        print('--------------------------------------------')
        print('Querying property list: ')
        self.get_property_from_cid()
        print('Property list query is complete!')
        print('--------------------------------------------')
        print('Querying synonym: ')
        self.get_synonyms_from_cid()
        print('Synonym query is complete!')
        print('--------------------------------------------')
        dt = {
            'InfoList': {
                'Info': [dict(d1, **d2) for d1, d2 in zip(self.prp, self.syn)]
            }
        }
        json_str = json.dumps(dt, indent=2)
        print('The data is being written to the JSON file: ')
        with open(self.out_path, 'w') as f:
            f.write(json_str)
        print('Finished writing the JSON file! ')
        print('--------------------------------------------')
        print('The data is being written to the CSV file: ')
        self.save_as_csv(dt['InfoList']['Info'])
        print('Finished writing the CSV file! ')
        os.system('pause')


if __name__ == '__main__':
    PubchemCrawlFast('cid.txt', 'data.json').__main__()



参考


https://pubchemdocs.ncbi.nlm.nih.gov/pug-rest



版权声明:本文为qq_42951560原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。