c++ - Access Violation while using _tcstok -
c++ - Access Violation while using _tcstok -
i trying tokenize lines in file using _tcstok. able tokenize line once, when seek tokenize sec time, access violation. sense has not accessing values, locations instead. i'm not sure how else though.
thanks,
dave
p.s. i'm using tchar , _tcstok because file utf-8.
this error i'm getting:
first-chance exception @ 0x63e866b4 (msvcr90d.dll) in testing.exe: 0xc0000005: access violation reading location 0x0000006c.
vector<tchar> tabdelimitedsource::getnext() { // returns next document (a given cell) file(s) tchar row[256]; // homecoming null if no more documents/rows vector<tchar> document; try{ //read each line in file, corresponding , individual document buff_reader->getline(row,10000); } grab (ifstream::failure e){ ; // ignore , fall through } if (_tcslen(row)>0){ this->current_row += 1; vector<tchar> cells; //separate line on tabs (id 'tab' document title 'tab' document body) tchar * pch; pch = _tcstok(row,"\t"); while (pch != null){ cells.push_back(*pch); pch = _tcstok(null, "\t"); } // split cell individual words using lucene analyzer try{ //separate body spaces tchar original_document ; original_document = (cells[column_holding_doc]); try{ tchar * pc; pc = _tcstok((char*)original_document," "); while (pch != null){ document.push_back(*pc); pc = _tcstok(null, "\t"); }
first up, code mongrel mixture of c string manipulation , c++ containers. dig hole. ideally should tokenize line std::vector<std::wstring>
also, you're confused tchar
, utf-8. tchar
character type 'floats' between 8 , 16 bits depending on compile time flags. utf-8 files utilize between 1 , 4 bytes represent each character. so, want hold text std::wstring
objects, you're going need explicitly convert utf-8 wstrings.
but, if want anything working, focus on tokenization. need store address of start of each token (as tchar*
) vector vector of tchar
s instead. when seek utilize token data, you're casting tchar
s tchar*
pointers, unsurprising result of access violations. av address give 0x0000006c
, ascii code character l
.
vector<tchar*> cells; ... cells.push_back(pch);
... , then...
tchar *original_document = cells[column_holding_doc]; tchar *pc = _tcstok(original_document," ");
c++ visual-studio-2008 utf-8 access-violation
Comments
Post a Comment